[go: up one dir, main page]

0% found this document useful (0 votes)
31 views57 pages

Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence focused on enabling computers to understand and generate human language. It has various applications including email classification, voice assistants, search engines, and machine translation. The document also discusses NLP tasks, challenges, and approaches, emphasizing the relationship between NLP, machine learning, and deep learning.

Uploaded by

sowmya17280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views57 pages

Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence focused on enabling computers to understand and generate human language. It has various applications including email classification, voice assistants, search engines, and machine translation. The document also discusses NLP tasks, challenges, and approaches, emphasizing the relationship between NLP, machine learning, and deep learning.

Uploaded by

sowmya17280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

1

NLP

Assignment 1
1) Define natural language processing. List and explain different NLP applications.

Ans|

Define Natural Language Processing (NLP)


Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the
interaction between computers and humans through natural language. It involves the
application of computational techniques to the analysis and synthesis of natural language and
speech. The goal of NLP is to enable computers to understand, interpret, and generate human
language in a useful way. This includes tasks such as machine translation, sentiment analysis,
and speech recognition, among others.

Different NLP Applications

Email Platforms
Explanation: Email platforms like Gmail and Outlook use NLP to offer features such as
spam classification, priority inbox sorting, calendar event extraction, and auto-complete.
Example: Gmail automatically categorizes incoming emails into primary, social,
promotions, etc., and filters out spam.

Voice-Based Assistants

Explanation: Assistants like Apple Siri, Google Assistant, Microsoft Cortana, and Amazon
Alexa rely on NLP to understand user commands and provide appropriate responses.
Example: Siri setting a reminder or answering a question about the weather.

Search Engines
Explanation: Search engines like Google and Bing use NLP for query understanding, query
expansion, question answering, information retrieval, and result ranking and grouping.
Example: Google interpreting a search query and providing the most relevant results.

Machine Translation Services


Explanation: Services like Google Translate and Amazon Translate use NLP to translate
text from one language to another, catering to a variety of business and personal use cases.
2

Example: Translating a document from English to Spanish using Google Translate.

Social Media Analysis


Explanation: Organizations analyze social media feeds using NLP to understand customer
opinions and sentiments better.
Example: A company analyzing Twitter mentions to gauge public sentiment about a new
product launch.

E-commerce Platforms
Explanation: E-commerce platforms like Amazon use NLP to extract information from
product descriptions and understand user reviews.
Example: Analyzing customer reviews to improve product recommendations.

Specialized Domains (Healthcare, Finance, Law)


Explanation: NLP is applied in domains like healthcare, finance, and law to solve specific
industry problems.
Example: Using NLP to analyze patient records and assist in diagnosis.

Automated Report Generation


Explanation: Companies like Arria use NLP to generate reports automatically for various
domains such as weather forecasting and financial services.
Example: Generating a financial summary report based on data inputs.

Spelling and Grammar Correction Tools


Explanation: Tools like Grammarly and the spell check features in Microsoft Word and
Google Docs use NLP to correct spelling and grammar.
Example: Grammarly suggesting corrections to grammatical errors in a document.

Jeopardy! and AI Competitions


Explanation: IBM’s Watson AI, which uses NLP techniques, competed in and won the TV
quiz show Jeopardy!, showcasing the capabilities of NLP in understanding and responding to
complex queries.
Example: Watson interpreting and answering questions in Jeopardy!.

Learning and Assessment Tools


Explanation: NLP is used in automated exam scoring (e.g., GRE), plagiarism detection
3

(e.g., Turnitin), intelligent tutoring systems, and language learning apps like Duolingo.
Example: Turnitin detecting plagiarism in student essays.

Knowledge Bases
Explanation: Large knowledge bases like the Google Knowledge Graph are built using NLP
and are used in applications like search and question answering.
Example: Google’s Knowledge Graph providing detailed information about entities directly
in search results.

2) Explain various NLP tasks and draw diagram to represent the relative difficulty
among the task.

Ans|

NLP tasks:

Language Modeling
Explanation: Language modeling involves predicting the next word in a sentence based on
the history of previous words. The goal is to learn the probability of sequences of words in a
given language.
Applications: This task is foundational for many problems, such as speech recognition,
optical character recognition, handwriting recognition, machine translation, and spelling
correction.
4

Text Classification
Explanation: Text classification is the task of categorizing text into predefined categories
based on its content. It is the most popular task in NLP.
Applications: This is used in various tools, including email spam detection, sentiment
analysis, and topic classification.

Information Extraction
Explanation: Information extraction involves identifying and extracting relevant information
from text, such as names, dates, and specific facts.
Applications: Extracting calendar events from emails or identifying names of people
mentioned in social media posts.

Information Retrieval
Explanation: This task involves finding documents relevant to a user query from a large
collection.
Applications: Google Search is a well-known example of an information retrieval
application.

Conversational Agent
Explanation: Building dialogue systems that can converse with users in natural language.
Applications: Voice assistants like Alexa, Siri, Google Assistant, and chatbots.

Text Summarization
Explanation: The task of creating concise summaries of longer documents while retaining
the core content and meaning.
Applications: Summarizing news articles, research papers, or lengthy reports.

Question Answering
Explanation: Developing systems that can automatically answer questions posed in natural
language.
Applications: Virtual assistants and customer support bots that provide precise answers to
user queries.

Machine Translation
Explanation: The task of translating text from one language to another.
Applications: Services like Google Translate and Amazon Translate.
5

Topic Modeling
Explanation: Uncovering the underlying topical structure of a large collection of documents.
Applications: Used in text mining to identify themes in literature, research papers, or large
text datasets.

3) What is a language? List and explain the building blocks of language with examples
and applications.

Ans|

Language is a structured system of communication that involves complex combinations of its


constituent components, such as characters, words, sentences, etc. Linguistics is the
systematic study of language. In NLP, understanding these linguistic concepts is crucial for
developing effective language processing applications.

Building Blocks of Language


Language can be thought of as composed of four major building blocks: phonemes,
morphemes and lexemes, syntax, and context. Each of these plays a critical role in NLP
applications.

Phonemes

Definition: Phonemes are the smallest units of sound in a language. They may not have
meaning individually but can form meaningful units when combined.
Example: In English, the word "cat" is composed of the phonemes /k/, /æ/, and /t/.
6

Applications: Phonemes are essential in applications involving speech understanding, such


as speech recognition, speech-to-text transcription, and text-to-speech conversion.

Morphemes and Lexemes

Definition:
Morphemes: The smallest units of language that carry meaning. They can be words,
prefixes, or suffixes.
Lexemes: Variations of morphemes that are related by meaning.
Examples:
Morphemes: In the word "unbreakable," "un-" is a prefix morpheme, "break" is the root
morpheme, and "-able" is a suffix morpheme.
Lexemes: "Run" and "running" are different forms of the same lexeme.
Applications: Morphological analysis, tokenization, stemming, learning word embeddings,
and part-of-speech tagging.

Syntax

Definition: Syntax refers to the rules for constructing grammatically correct sentences from
words and phrases in a language. It represents the structure of sentences.
Example: The sentence "The girl laughed at the monkey" can be represented as a parse tree
7

with a noun phrase (NP) "The girl" and a verb phrase (VP) "laughed at the monkey."
Applications: Parsing (automatically constructing parse trees), entity extraction, and relation
extraction.

Context:
Definition: Context is the information that surrounds language and helps convey meaning. It
includes semantics (literal meaning) and pragmatics (meaning derived from world knowledge
and external context).
Examples:
Semantics: The sentence "The bank is closed" refers directly to the closure of a financial
institution.
Pragmatics: In the sentence "The bank is closed," pragmatics could reveal whether it is a
financial bank or the side of a river based on the conversation context.
Applications: Complex NLP tasks such as sarcasm detection, summarization, and topic
modeling.

Applications of the Building Blocks


Phonemes:
Speech Recognition: Identifying spoken words and converting them to text.
Text-to-Speech: Converting written text into spoken words.
Morphemes and Lexemes:
Tokenization: Breaking text into meaningful units.
Stemming: Reducing words to their root forms.
Part-of-Speech Tagging: Identifying the grammatical roles of words in a sentence.
Syntax:
Parsing: Constructing grammatical structures from sentences.
Entity Extraction: Identifying entities like names, dates, and places in text.
Context:
Sarcasm Detection: Understanding implied meanings that contradict the literal words.
Summarization: Condensing long texts while maintaining core information.
Topic Modeling: Identifying themes within a large set of documents.
8

4) What makes NPL challenging or difficult problem domain?

Ans|

NLP, or Natural Language Processing, is challenging due to the inherent complexity and
nuances of human language. Below are some key factors that contribute to the difficulty of
NLP:

Ambiguity
Definition: Ambiguity means uncertainty of meaning. Human languages are inherently
ambiguous, allowing for multiple interpretations of the same text.
Examples:
The sentence “I made her duck” can mean either:
I cooked a duck for her.
I caused her to bend down to avoid an object.
Figurative language, such as idioms, further increases ambiguity. For instance, “He is as
good as John Doe” depends on the unspecified quality of John Doe.
Challenges: Disambiguating such sentences requires understanding the context, which is
often beyond the capabilities of most NLP systems.

Common Knowledge
Definition: Common knowledge refers to the set of facts that most humans are aware of and
use implicitly in conversation.
Examples:
The sentences “man bit dog” and “dog bit man” are syntactically similar, but humans
understand that the former is unlikely while the latter is possible due to common knowledge
about human and dog behavior.
Challenges: Encoding common knowledge into computational models is difficult because it
involves a vast and often implicit understanding of the world that humans take for granted.

Creativity
Definition: Language involves not just rules but also creativity, manifesting in various styles,
dialects, genres, and variations.
Examples:
Poetry, metaphors, and idiomatic expressions showcase the creative use of language.
Challenges: Understanding and processing creative aspects of language is a hard problem in
AI, as it requires going beyond rule-based approaches to capture the nuances and innovation
in human expression.
9

Diversity Across Languages


Definition: Languages vary greatly, with no direct mapping between the vocabularies and
structures of any two languages.
Examples:
An NLP solution developed for English may not work for Chinese, Hindi, or Arabic due to
differences in syntax, semantics, and idiomatic usage.
Challenges: Developing language-agnostic solutions is conceptually difficult, while building
separate solutions for each language is laborious and time-intensive.

5) With a neat diagram explain how NLP, ML and Deep Learning are related?

Ans|

Artificial Intelligence (AI): The outermost layer encompasses all intelligent systems,
including those based on traditional logic and heuristic approaches as well as learning-based
systems.

Machine Learning (ML): Within AI, ML focuses on systems that learn from data. It
overlaps with other AI approaches but represents a distinct methodology centered on data-
driven learning.

Deep Learning (DL): A subset of ML, DL uses multi-layered neural networks to handle
complex data patterns and tasks. It is a powerful tool within the ML toolkit, particularly for
high-dimensional data.

Natural Language Processing (NLP): NLP sits within AI and intersects significantly with
ML and DL. Traditional NLP methods often relied on rule-based systems, but modern NLP
increasingly leverages ML and DL techniques to achieve more sophisticated language
understanding and generation.

Relationships and Interactions


10

Traditional AI and Early NLP:

Initially, AI and NLP used rule-based and heuristic methods to process language. These
methods included handcrafted rules to parse and understand language.
ML and NLP:

Over the past few decades, NLP has increasingly adopted ML techniques. These involve
training models on large datasets to perform tasks such as text classification, sentiment
analysis, and named entity recognition.
DL and NLP:

In recent years, DL has revolutionized NLP. Deep learning models, especially those based on
neural networks, have achieved state-of-the-art results in many NLP tasks. Examples include
language modeling, machine translation, and speech recognition. DL models can
automatically learn hierarchical features from raw text data, reducing the need for manual
feature engineering.

Examples of Applications
Supervised Learning in NLP: Email spam classification, sentiment analysis, and part-of-
speech tagging.
Unsupervised Learning in NLP: Topic modeling, clustering, and word embeddings.
Deep Learning in NLP: Transformer models like BERT and GPT for tasks such as text
generation, translation, and summarization.

6) List and explain different approaches used to solve NLP pipeline?

Ans|

 Heuristics-Based NLP
 Machine Learning for NLP
 Deep Learning for NLP

1. Heuristics-Based NLP
Definition: This approach involves building rules for the task at hand. It requires domain
expertise to formulate rules that can be incorporated into a program. Resources such as
dictionaries and thesauruses are typically used.

Examples and Tools:

Lexicon-based Sentiment Analysis: Uses counts of positive and negative words in the text
to deduce sentiment.
Knowledge Bases: Tools like Wordnet, which is a database of words and semantic
11

relationships (synonyms, hyponyms, meronyms).


Common Sense Knowledge Bases: Examples include Open Mind Common Sense.
Regular Expressions (Regex): Used for text analysis and finding patterns, such as
identifying email IDs from text.
Context-Free Grammar (CFG): A type of formal grammar used to model natural
languages.
Frameworks and Tools: TokensRegex (StanfordCoreNLP), JAPE (Java Annotation
Patterns Engine), and GATE (General Architecture for Text Engineering).

2. Machine Learning for NLP


Definition: Machine learning techniques are applied to textual data, involving supervised
learning (classification and regression) and unsupervised learning (clustering).

Steps:

Feature Extraction: Extracting features from text.


Model Learning: Using the feature representation to learn a model.
Model Evaluation and Improvement: Evaluating and refining the model.

Popular Algorithms:
Naive Bayes: A classification algorithm based on Bayes' theorem, assuming feature
independence.
Support Vector Machine (SVM): Learns decision boundaries (linear or nonlinear) to
separate data points of different classes.
Hidden Markov Model (HMM): Assumes an underlying hidden process generating the
data; used for tasks like part-of-speech (POS) tagging.
Conditional Random Fields (CRF): Used for sequential data, performs classification on
each element in a sequence, often outperforming HMMs for tasks like POS tagging.

3. Deep Learning for NLP


Definition: Uses neural networks to handle complex, unstructured data. Deep learning
models have superior representation and learning capabilities for language tasks.

Popular Architectures:
Recurrent Neural Networks (RNNs): Suitable for sequential data, remembers previously
processed information, used in tasks like text classification and machine translation.
Long Short-Term Memory Networks (LSTMs): A type of RNN that can remember longer
contexts by selectively forgetting irrelevant information.
Convolutional Neural Networks (CNNs): Common in computer vision, also used for text
classification by treating text as a matrix of word vectors.
Transformers: Utilize self-attention mechanisms to model textual context, with models like
12

BERT (Bidirectional Encoder Representations from Transformers) achieving state-of-the-art


performance.
Autoencoders: Learn compressed vector representations of input data, useful for creating
feature representations for downstream tasks.

7) With a block diagram explain briefly the key stages of NLP pipeline?

Ans|

Data Acquisition: Obtaining relevant data for the NLP task at hand.
Text Cleaning: Processing and cleaning the acquired text data to remove noise,
inconsistencies, and irrelevant information.
Pre-processing: Converting the cleaned text data into a standardized format suitable for
further analysis.
Feature Engineering: Extracting meaningful features from the pre-processed text data to
represent it in a format understandable by modeling algorithms.
Modeling: Building one or more models using the engineered features to solve the NLP task.
Evaluation: Assessing the performance of the models using relevant evaluation metrics and
comparing them to select the best-performing model.
Deployment: Integrating the chosen model into production systems for real-world use.
Monitoring & Model Updating: Regularly monitoring the deployed model's performance
and updating it as needed to maintain optimal performance over time.

8) Define data acquisition. List and explain various sources for data acquisition.

Ans|

Definition of Data Acquisition:


Data acquisition refers to the process of collecting and obtaining relevant data for a particular
task or project. In the context of machine learning and natural language processing (NLP),
data acquisition is crucial as it forms the foundation for training and building models.

Various Sources for Data Acquisition:


13

a. Public Datasets:
Public datasets are collections of data that are openly available for use by anyone. These
datasets are often curated and maintained by organizations, research institutions, or
government agencies.
Researchers and developers can leverage public datasets for various NLP tasks if the data is
relevant to their project. Platforms like Kaggle, UCI Machine Learning Repository, and
Google's Dataset Search provide access to a wide range of public datasets.

b. Web Scraping:
Web scraping involves extracting data from websites. Developers can scrape relevant text
data from online sources such as forums, social media platforms, or discussion boards where
users post queries or comments related to the NLP task at hand.
Once the data is scraped, it may need to be cleaned and labeled before it can be used for
training NLP models.

c. Product Intervention:
In industrial settings, AI models are often integrated into products or services. Product teams
can collaborate with AI teams to collect data directly from the usage of the product.
This approach, known as product intervention, involves instrumenting the product to collect
user interactions and feedback, which can then be used as training data for NLP models.

d. Data Augmentation:
Data augmentation techniques involve generating synthetic data from existing datasets to
increase the diversity and size of the training data.
Methods such as synonym replacement, back translation, TF-IDF-based word replacement,
bigram flipping, replacing entities, and adding noise to data can be used to create augmented
data for NLP tasks.

e. Advanced Techniques:
Advanced techniques such as Snorkel, Easy Data Augmentation (EDA), NLPAug, and active
learning provide additional methods for data acquisition and augmentation.
These techniques automate the process of generating training data or facilitate the acquisition
of labeled data in scenarios where manual labeling is expensive or time-consuming.

9) Define text cleaning with proper code explain the steps applied in text cleaning.

Ans|

Definition of Text Cleaning:


Text cleaning refers to the process of removing unwanted or irrelevant information from raw
text data to prepare it for further analysis or processing. This involves tasks such as removing
14

markup, metadata, special characters, spelling correction, and normalization to ensure that
the text is in a clean and standardized format.

Steps Applied in Text Cleaning:

Step 1: Lowercase / Uppercase

It helps to maintain the consistency flow during the NLP tasks and text mining.
The lower() function makes the whole process quite straightforward.

Ex:

text = "This is a Demo Text for NLP using NLTK."

lower_text = text.lower() / upper_text = text.upper()

print (lower_text) // this is demo text for nlp using nltk.

print( upper_text) // THIS IS A DEMO TEXT FOR NLP USING NLTK.

Step 2 : Punctuation Removal

In this step we will be removing all punctuations ,because the punctuation to the sentence
adds up noise that brings ambiguity while training the model.

Ex:

import re

import nltk

from nltk.tokenize import word_tokenize

text = "This is a sample sentence, showing off the stop words filtration.“

tokens = word_tokenize(text)

# Regular expression to match punctuation

cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens if re.sub(r'[^\w\s]', '',


token)]

print(cleaned_tokens)

['This', 'is', 'a', 'sample', 'sentence', 'showing', 'off', 'the', 'stop', 'words', 'filtration']

Step 3: HTML Parsing and Cleanup:


Extracting text from HTML documents involves removing HTML tags and retaining only the
15

textual content. Libraries like Beautiful Soup and Scrapy can be used to parse HTML and
extract desired text elements.
Below is an example code snippet using Beautiful Soup to extract a question and its best
answer from a Stack Overflow web page:

Code:

from bs4 import BeautifulSoup

from urllib.request import urlopen

# Define URL

myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-
python"

# Read HTML content from URL

html = urlopen(myurl).read()

# Parse HTML using Beautiful Soup

soupified = BeautifulSoup(html, "html.parser")

# Extract question

question = soupified.find("div", {"class": "question"})

questiontext = question.find("div", {"class": "post-text"})

print("Question: \n", questiontext.get_text().strip())

# Extract best answer

answer = soupified.find("div", {"class": "answer"})

answertext = answer.find("div", {"class": "post-text"})

print("Best answer: \n", answertext.get_text().strip())

step 4: Spelling Correction:


Spelling correction involves identifying and correcting misspelled words in the text.
Libraries like pyenchant can be used for spell checking.
Example:

import enchant

# Initialize spell checker


16

spellchecker = enchant.Dict("en_US")

# Example text with misspellings

example_text = "Hollo, wrld"

# Check and correct spelling

corrected_text = " ".join([spellchecker.suggest(word)[0] if not spellchecker.check(word)


else word for word in example_text.split()])

print(corrected_text)

10) Define preprocessing with proper code. Explain the steps involved in preprocessing.

Ans|

Preprocessing in NLP
Preprocessing prepares raw text data for analysis and model building, ensuring it is cleaned,
normalized, and structured. Key steps include sentence segmentation, word tokenization, stop
word removal, lowercasing, removing punctuation and digits, stemming, and lemmatization.

Steps Involved in Preprocessing


Sentence Segmentation: Splitting text into sentences.
step 1: Word Tokenization: Splitting sentences into words.
Step 2: Stop Word Removal: Removing common, insignificant words.
17

Step 3 : Normalization: Normalization is an advanced step in cleaning to maintain


uniformity. It brings all the words under on the roof by adding stemming and lemmatization.

Step 4: Stemming: Reducing words to their base or root form.


Step 5: Lemmatization: Converting words to their base or dictionary form.
Here's a Python example using the NLTK library:

from nltk.corpus import stopwords

From string import punctuation

def preprocess_corpus(texts):

mystopwords = set(stopwords.words("english"))

def remove_stops_digits(tokens):

return [token.lower() for token in tokens if token not in mystopwords

not token.isdigit() and token not in punctuation]

return [remove_stops_digits(word_tokenize(text)) for text in texts]

Explanation
Sentence Segmentation: sent_tokenize(mytext) splits the text into sentences.
Word Tokenization: word_tokenize(sentence) splits sentences into words.
Stop Word Removal, Lowercasing, Removing Digits/Punctuation:
remove_stops_digits(tokens) removes stop words, digits, punctuation, and converts to
lowercase.
Stemming: stem_tokens(tokens) applies the PorterStemmer.
Lemmatization: lemmatize_tokens(tokens) converts words to their dictionary form.

11) Define stemming and lemmatization. With proper code explain the working of
stemming and lemmatization.

Ans|

Stemming and Lemmatization


Stemming and lemmatization are both text normalization techniques in Natural Language
Processing (NLP) used to reduce words to their base or root form. However, they achieve
this through different methods and with varying levels of linguistic accuracy.

Stemming
Stemming is text preprocessing technique in NPL this process reduces inflected form of a
18

word to one word called as stem or root form.

Example of Stemming using NLTK:

from nltk.stem.porter import PorterStemmer

# Initialize the Porter Stemmer

stemmer = PorterStemmer()

# Sample words

word1, word2 = "cars", "revolution"

# Applying stemming

print(stemmer.stem(word1)) # Output: car

print(stemmer.stem(word2)) # Output: revolut

In this example, "cars" is stemmed to "car," and "revolution" is stemmed to "revolut," where
the latter is not a correct linguistic form.

Lemmatization
Lemmatization reduces words to their base or dictionary form, known as a lemma,
considering the context and part of speech (POS) of the word. It requires more linguistic
knowledge compared to stemming and usually results in a valid word.

Example of Lemmatization using NLTK:

from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer

lemmatizer = WordNetLemmatizer()

# Applying lemmatization

print(lemmatizer.lemmatize("better", pos="a")) # Output: good

Here, "better" is lemmatized to "good" when specifying it as an adjective (POS = "a").

Example of Lemmatization using spaCy:

import spacy

# Load the spaCy model


19

sp = spacy.load('en_core_web_sm')

# Sample word

token = sp(u'better')

# Applying lemmatization

for word in token:

print(word.text, word.lemma_) # Output: better well

In this case, spaCy lemmatizes "better" to "well." Both "good" (from NLTK) and "well" are
contextually correct.

Summary
Stemming: Uses heuristic rules to strip suffixes (e.g., "cars" to "car," "revolution" to
"revolut").
Lemmatization: Uses linguistic knowledge to map words to their base form (e.g., "better" to
"good").
Stemming is faster but less accurate, whereas lemmatization is more accurate but slower and
requires part-of-speech tagging to determine the correct base form of a word.

12) Define and use phase structure grammar rule to draw parse tree for the below
sentences.
I. I saw the man on the hill with the telescope
II. The house at the end of the street is red

Ans|
Natural Language Processing (NLP) 1

2nd Assignment:

1. Explain in detail the techniques of Vector Space Model (VSM) and List the Pro’s and Con’s.

• VSM is used to represent text as vectors of numbers.

• Helps convert raw text into numerical form for ML and NLP models.

• Each word or document is mapped to a point in a multi-dimensional space.

• The process is known as text representation or feature engineering.

• Common units include: words, phrases, sentences, or full documents.

• The dimensions of the vector space are defined by a vocabulary (list of unique terms).

• Words are often represented using Bag-of-Words (BoW) or TF-IDF.

• BoW counts how many times a word appears in a document.

• TF-IDF (Term Frequency-Inverse Document Frequency) weighs words by importance.

• VSM allows us to perform text similarity and search operations.

• Cosine similarity is used to measure how similar two vectors are.

• Formula: Cos(θ) = (A · B) / (||A|| × ||B||) where A and B are text vectors.

• Sometimes Euclidean distance is used to find dissimilarity.

• VSM is the basis for classification, clustering, and retrieval tasks.

• It can be applied to any natural language text.

• It forms the foundation for more advanced models like Word2Vec and BERT.

Pros of VSM:

• Simple, intuitive, and easy to implement.

• Compatible with most machine learning algorithms.

• Provides a mathematical structure to work with text data.

• Works well for basic search and ranking operations.

• Serves as a starting point for more complex text models.


Cons of VSM: 2

• Treats words as independent units, missing semantic meaning.

• Produces sparse and high-dimensional vectors, which are inefficient.

• Cannot handle word relationships or context.

• Struggles with Out-of-Vocabulary (OOV) words.

• Performance degrades with long or complex texts.

• Not suitable for dynamic or real-time language use.

• Lacks ability to generate fixed-length vectors for varying text sizes.

2. Explain in detail the techniques of Bag of Words (BoW) and List the Pro’s and Con’s.

• BoW is a basic and widely used text representation method in NLP.

• It assumes that word frequency is more important than word order.

• Represents a document as a vector of word counts.

• Builds a vocabulary from all unique words in the dataset.

• Each word is assigned a unique index or ID in the vocabulary.

• The document is converted into a |V|-dimensional vector, where V = size of vocabulary.

• Each vector component stores the frequency of a word in the document.

• Word order is ignored, only frequency matters.

• BoW is similar to one-hot encoding but with word counts instead of binary flags.

• Helps in text classification, spam detection, sentiment analysis, etc.

• Works well when combined with ML models like Naive Bayes or SVM.

• Simple to implement using CountVectorizer in Python (e.g., scikit-learn).

• Example:

o Documents: ["dog bites man", "man eats food"]

o Vocabulary: [bites, dog, eats, food, man]

o Vectors: [1,1,0,0,1] and [0,0,1,1,1]

• Commonly used as a baseline before applying more complex models.


• Often combined with TF-IDF to reduce the effect of common words. 3

• Performs well on small to medium-sized datasets.

Pros of BoW:

• Very simple, fast, and easy to implement.

• Converts text into a numerical format usable by ML models.

• Works well for basic NLP tasks and classification problems.

• No complex preprocessing or deep linguistic knowledge required.

• Performs decently with well-defined domains (e.g., product reviews).

Cons of BoW:

• Ignores grammar and word order completely.

• Cannot capture semantics or meaning of the words.

• Results in sparse and high-dimensional vectors, increasing memory use.

• Suffers from the curse of dimensionality as vocabulary grows.

• Cannot handle Out-of-Vocabulary (OOV) words in test data.

• Does not consider context (e.g., "bank of a river" vs "money in the bank").

• Fails to understand synonyms or related words

3. Explain in detail the techniques of Bag of n-gram (BoN) and List the Pro’s and Con’s.

• BoN is a text representation technique used in NLP.

• It improves on Bag of Words by including word order information.

• Breaks down text into contiguous sequences of 'n' words, called n-grams.

• Examples:

o For n=1 → unigrams: "dog", "bites", "man"

o For n=2 → bigrams: "dog bites", "bites man"

o For n=3 → trigrams: "dog bites man"

• The vocabulary is built using all unique n-grams in the dataset.

• Each document is represented as a |V|-dimensional vector, where V = number of unique n-grams.


• The value at index i in the vector represents the frequency of that n-gram in the document. 4

• If an n-gram is missing in a document, the value is zero at that index.

• Helps in retaining local context, which BoW completely ignores.

• Allows the model to distinguish between similar sentences with different meanings.

• Known as n-gram feature extraction or contextual bag-of-words.

• It is especially useful in language modeling, text classification, and authorship detection.

• BoW is a special case of BoN when n = 1.

• Can be implemented easily using tools like CountVectorizer with ngram_range in Python.

• Common practice: use bigrams or trigrams for balance between context and complexity.

• Best used with smaller 'n' values to avoid vector explosion.

Pros of Bag of N-Grams (BoN):

• Captures word order and context (unlike BoW).

• Improves performance in tasks like sentiment analysis and spam detection.

• Enhances semantic similarity between documents.

• Helps distinguish between phrases like "not good" and "very good".

• Simple and compatible with machine learning models.

Cons of Bag of N-Grams (BoN):

• High dimensionality as 'n' increases (combinatorial explosion).

• Leads to sparse vectors, increasing memory and processing time.

• Still suffers from Out-of-Vocabulary (OOV) problems.

• Does not account for long-range dependencies or deep semantics.

• Requires large training data to build reliable n-gram vocabularies.

4. Explain in detail the techniques of TF-IDF and List the Pro’s and Con’s.

• TF-IDF is a numerical text representation used in NLP.

• It measures how important a word is in a document relative to the whole corpus.

• Improves on Bag of Words by reducing the weight of common, less informative words.
• TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).
5

• Term Frequency (TF) = (Number of times term t appears in document d) ÷ (Total terms in document
d).

• Inverse Document Frequency (IDF) = logₑ(Total number of documents ÷ Number of documents


containing term t).

• Final TF-IDF score = TF × IDF.

• High TF-IDF means the word is frequent in a specific document but rare across other documents.

• Common words like "the", "is", "and" get low IDF scores, reducing their overall impact.

• Creates a vector representation for each document using TF-IDF values of its terms.

• These vectors are used in tasks like document similarity, classification, and clustering.

• Popular tools like scikit-learn’s TfidfVectorizer can be used for implementation.

• Example:

o Word "machine" appears 4 times in a 100-word doc (TF = 0.04).

o If it appears in 2 out of 10 documents, IDF = log(10/2) = 1.609.

o TF-IDF = 0.04 × 1.609 ≈ 0.064.

• TF-IDF does not preserve word order, but it improves on raw frequency.

• Helps highlight keywords and distinguishing terms in a document.

• Often used as a baseline in NLP pipelines before moving to neural models.

• Works best when documents are short to medium-length.

Pros of TF-IDF:

• Captures word importance better than BoW.

• Reduces influence of very common (stop) words.

• Simple to compute and understand.

• Useful in many text mining and information retrieval tasks.

• Compatible with similarity metrics like cosine similarity.

Cons of TF-IDF:

• Produces high-dimensional and sparse vectors.


• Cannot handle Out-of-Vocabulary (OOV) words. 6

• Does not consider word order or semantics.

• Not ideal for handling synonyms or polysemy (multiple meanings).

• Fails to capture contextual meaning of words in different settings.

5. Explain Word Embedding’s and briefly describe the 2 variants of Word Embedding.

• Word embeddings are dense, low-dimensional vector representations of words.

• They capture the meaning of words based on their context in large text corpora.

• Built on the distributional hypothesis: Words used in similar contexts have similar meanings.

• Unlike sparse methods (BoW, TF-IDF), embeddings are compact and efficient.

• Example: "king" – "man" + "woman" ≈ "queen".

• Embeddings place similar words close together in vector space.

• Enable downstream NLP tasks like sentiment analysis, translation, and question answering.

• Word embeddings are learned from large datasets using neural networks.

• Common pre-trained models: Word2Vec, GloVe, fastText.

• Greatly reduce dimensionality while capturing semantic and syntactic relationships.

• Help handle semantic similarity, analogies, and contextual clues.

• Can be visualized using tools like t-SNE or PCA to show clustering of similar words.

1. Continuous Bag of Words (CBOW):

• Goal: Predict the center word using surrounding context words.

• Takes multiple context words (e.g., before and after) as input.

• Produces the center word as output.

• Works well with frequent words and is faster to train.

• Uses averaged context vectors to predict the target.

• Based on a shallow neural network with one hidden layer.


7

2. SkipGram:

• Goal: Predict surrounding context words from a center word.

• Takes the center word as input, predicts multiple context words as output.

• Generates multiple (center, context) training pairs for each window.

• Better at capturing rare word representations.

• Slightly slower than CBOW but more effective for small datasets.
6. What are the key differences between CBOW and Skip Gram models in the context of word 8
embeddings?

Aspect CBOW (Continuous Bag of Words) SkipGram

Predict the center word from surrounding


Objective Predict the context words from a center word
context words

Input Multiple context words One center word

Output One center (target) word Multiple context words

Training Examples per


1 example per window 2k examples per window (where k = context size)
Window

Data Efficiency More efficient with frequent words Better at learning rare word representations

Computation Time Generally faster Slower due to more training samples

Embedding Update Style Averages or sums context word vectors Uses center word vector for multiple predictions

Output Layer Complexity Single prediction (center word) Multiple predictions (context words)

Use Case Preference Suitable for large corpora with frequent words Suitable for smaller corpora or rare words

Shallow neural network with averaged context


Architecture Shallow neural network with center word as input
input

Slower convergence, but can capture more subtle


Training Stability More stable and faster convergence
meanings

Semantic Relationships Good, but less detailed Captures richer semantic relationships

Input: ["the", "brown", "fox"] → Output:


Example Input/Output Input: "quick" → Output: ["the", "brown", "fox"]
"quick"

7. Consider the Training Corpus with 4 sentences or documents Apply VSM, BoW, TF-IDF modeling
techniques to find the word to vector text representation.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
Corpus: 9

Document Text

D1 Dog bites man

D2 Man bites dog

D3 Dog eats meat

D4 Man eats food

Vocabulary (with fixed order):

[dog, bites, man, eats, meat, food]

Vector Space Model (VSM)

VSM is a general framework where documents are represented as vectors in a high-dimensional space.
BoW and TF-IDF are specific implementations of VSM.
Similarity between documents can be calculated using cosine similarity or Euclidean distance.

VSM Representation for the Corpus

Document Vector Representation (using vocabulary: dog, bites, man, eats, meat, food)

D1 [1, 1, 1, 0, 0, 0]

D2 [1, 1, 1, 0, 0, 0]

D3 [1, 0, 0, 1, 1, 0]

D4 [0, 0, 1, 1, 0, 1]

This is the basic VSM using raw frequency counts, which is exactly the Bag of Words representation.

Bag of Words (BoW) Vectors

Document dog bites man eats meat food BoW Vector

D1 1 1 1 0 0 0 [1, 1, 1, 0, 0, 0]

D2 1 1 1 0 0 0 [1, 1, 1, 0, 0, 0]
Document dog bites man eats meat food BoW Vector 10

D3 1 0 0 1 1 0 [1, 0, 0, 1, 1, 0]

D4 0 0 1 1 0 1 [0, 0, 1, 1, 0, 1]

TF-IDF Representation

Step 1: IDF for Each Term

IDF(term)=log⁡2(Ndf)\text{IDF(term)} = \log_2\left(\frac{N}{df}\right)

Term Document Frequency (df) IDF (approx, base 2)

dog 3 0.415

bites 2 1.000

man 3 0.415

eats 2 1.000

meat 1 2.000

food 1 2.000

Step 2: TF-IDF Vectors

(Each word in a document appears once; document length = 3, so TF = 1/3 = 0.333)

Document dog bites man eats meat food TF-IDF Vector

D1 0.1383 0.3333 0.1383 0 0 0 [0.1383, 0.3333, 0.1383, 0, 0, 0]

D2 0.1383 0.3333 0.1383 0 0 0 [0.1383, 0.3333, 0.1383, 0, 0, 0]

D3 0.1383 0 0 0.3333 0.6667 0 [0.1383, 0, 0, 0.3333, 0.6667, 0]

D4 0 0 0.1383 0.3333 0 0.6667 [0, 0, 0.1383, 0.3333, 0, 0.6667]


8. Consider the given Training corpus apply bigram modeling technique to find the probability of11each
bigram formed.

I am a human

I am not a stone

I live in Ballari

1. Text Representation and Bigram Modeling

Text representation is the process of transforming raw text into a numerical form (numeric vectors) so that it
can be processed by machine learning algorithms. The Vector Space Model (VSM) is a foundational concept
where text units are represented as vectors of numbers.

The Bag of N-Grams (BoN) approach is a text representation scheme that addresses the limitation of earlier
methods (like Bag of Words) by capturing some context and word-order information.

2. Pre-processing the Corpus

• S1: [i, am, a, human]

• S2: [i, am, not, a, stone]

• S3: [i, live, in, ballari]

3. Forming Bigrams and Counting Frequencies

Bigrams from the Corpus:

• (i, am) [from S1, S2]

• (am, a) [from S1]

• (a, human) [from S1]

• (am, not) [from S2]

• (not, a) [from S2]

• (a, stone) [from S2]

• (i, live) [from S3]

• (live, in) [from S3]

• (in, ballari) [from S3]


Frequency Counts: 12

Bigram Count

(i, am) 2

(am, a) 1

(a, human) 1

(am, not) 1

(not, a) 1

(a, stone) 1

(i, live) 1

(live, in) 1

(in, ballari) 1

Unigram Frequency Counts (Count of individual words):

Unigram (Word) Count

i 3

am 2

a 2

human 1

not 1

stone 1

live 1

in 1

ballari 1
4. Calculating Bigram Probabilities 13

• P (am | i) = Count (i, am) / Count(i) = 2 / 3 ≈ 0.667

• P(a | am) = Count(am, a) / Count(am) = 1 / 2 = 0.5

• P(human | a) = Count(a, human) / Count(a) = 1 / 2 = 0.5

• P(not | am) = Count(am, not) / Count(am) = 1 / 2 = 0.5

• P(a | not) = Count(not, a) / Count(not) = 1 / 1 = 1.0

• P(stone | a) = Count(a, stone) / Count(a) = 1 / 2 = 0.5

• P(live | i) = Count(i, live) / Count(i) = 1 / 3 ≈ 0.333

• P(in | live) = Count(live, in) / Count(live) = 1 / 1 = 1.0

• P(ballari | in) = Count(in, ballari) / Count(in) = 1 / 1 = 1.0

9. What is Text Classification? Steps to build Text Classification and its applications.

• Text classification is an NLP task of assigning categories to textual data (like sentences, documents,
reviews).

• It can be:

o Binary (e.g., spam vs. non-spam),

o Multiclass (e.g., positive, neutral, negative sentiment),

o Multilabel (text can belong to multiple classes).

Steps to Build a Text Classification System


14

1. Training Data Collection

• Collect labeled text data (e.g., emails with spam/not spam labels).

• Use public datasets or apply techniques like:

o Data augmentation,

o Weak supervision (e.g., Snorkel),

o Active learning.

2–3. Preprocessing and Feature Extraction

• Clean and prepare the text using:

o Tokenization, lowercasing, stop word removal,

o Stemming/lemmatization,

o Text normalization and language detection.

• Convert text to numbers using:

o BoW, TF-IDF, Word2Vec, BERT, etc.

4–5. Train and Evaluate Classifier

• Algorithms used:

o Traditional ML: Naive Bayes, SVM, Logistic Regression,

o Deep Learning: CNNs, RNNs, LSTMs, Transformers (BERT).

• Evaluate using:

o Accuracy, Precision, Recall, F1-score, AUC, Confusion Matrix.


15

6. Deploy and Predict on New Texts

• Deploy trained model as a service.

• Use it to predict categories for new, unseen texts.

• Monitor and update regularly for real-world performance.

Applications of Text Classification

• Spam detection in emails.

• Sentiment analysis of product reviews.

• News and content categorization.

• Language detection (e.g., in Google Translate).

• Fake news detection.

• E-commerce product classification.

10. With code snippet explain the classification modeling using Naive Bayes classifier.

• Naive Bayes classifier applies Bayes’ theorem assuming feature independence, predicting the class
with highest posterior probability.

• Commonly used in text classification as a baseline model due to simplicity and efficiency.

• Pipeline steps:

1. Train-test split: Split data into training and test sets.

2. Preprocessing & Vectorization: Clean text (lowercase, remove punctuation/digits), convert to


numeric vectors using CountVectorizer (BoW).

3. Train classifier: Fit MultinomialNB on training vectors.

4. Predict and evaluate: Predict on test vectors and evaluate accuracy.

Code snippet

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.naive_bayes import MultinomialNB 16

from sklearn import metrics

import re

def clean(text):

text = text.lower()

text = re.sub(r'[\d\W]+', ' ', text)

return text

# Example dataset

X = ["I love this product", "This is an amazing book", "I hate this movie", "This movie is terrible"]

y = ['positive', 'positive', 'negative', 'negative']

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorize

vect = CountVectorizer(preprocessor=clean, stop_words='english')

X_train_dtm = vect.fit_transform(X_train)

X_test_dtm = vect.transform(X_test)

# Train Naive Bayes

nb = MultinomialNB()

nb.fit(X_train_dtm, y_train)

# Predict & Evaluate

y_pred = nb.predict(X_test_dtm)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

11. With code snippet explain the classification modeling using Logistic Regression.

• Logistic Regression is a discriminative classifier that models the probability distribution over classes.

• It learns weights for features, aiming to find a linear decision boundary to separate classes.

• Uses the logistic (sigmoid) function to estimate probabilities.


• Often used as a baseline and real-world MVP in text classification tasks. 17

• Can handle imbalanced datasets using the class_weight="balanced" parameter.

Classification Modeling Steps:

1. Train Logistic Regression model on the training feature vectors.

2. Predict classes on the test data.

3. Evaluate performance using metrics like accuracy.

Code Snippet

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Assuming X_train_dtm, X_test_dtm, y_train, y_test are already defined

# Instantiate Logistic Regression with balanced class weights

logreg = LogisticRegression(class_weight="balanced", max_iter=1000)

# Train the model

logreg.fit(X_train_dtm, y_train)

# Predict classes on test data

y_pred_class = logreg.predict(X_test_dtm)

# Evaluate accuracy

print("Accuracy:", accuracy_score(y_test, y_pred_class))

12. With code snippet explain the classification modeling using SVM.

• SVM is a discriminative classifier like Logistic Regression.

• It finds an optimal hyperplane that maximizes the margin between classes.

• Can handle non-linear boundaries using kernel tricks (though here we use a linear SVM).

• Typically takes longer to train than simpler models.

• Useful for high-dimensional sparse data like text.

• class_weight='balanced' can help with imbalanced classes by weighting inversely proportional to


class frequencies.
Classification Modeling Steps (with feature extraction and SVM training): 18

1. Use CountVectorizer to convert text into a document-term matrix with a limited number of features
(max_features=1000 to reduce complexity).

2. Train a Linear SVM classifier on the training data.

3. Predict classes on the test data.

4. Calculate accuracy.

Code Snippet

from sklearn.svm import LinearSVC

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import accuracy_score

# Assuming X_train, X_test, y_train, y_test are already defined

# 'clean' is a pre-processing function for text (e.g., lowercasing, removing punctuation)

# Step 1: Vectorize text with max 1000 features to limit training time

vect = CountVectorizer(preprocessor=clean, max_features=1000)

X_train_dtm = vect.fit_transform(X_train)

X_test_dtm = vect.transform(X_test)

# Step 2: Initialize Linear SVM with balanced class weights

classifier = LinearSVC(class_weight='balanced', max_iter=10000)

# Step 3: Train the classifier

classifier.fit(X_train_dtm, y_train)

# Step 4: Predict test set labels

y_pred_class = classifier.predict(X_test_dtm)

# Step 5: Evaluate accuracy

print("Accuracy:", accuracy_score(y_test, y_pred_class))


13. With code snippet explain the classification modeling using CNN. 19

• CNNs learn useful local features (like important n-grams) automatically.

• Typically, 1D convolution layers followed by pooling layers are stacked.

• Output layer size = number of classes; uses softmax activation for multi-class classification.

• Embedding layer maps words to dense vectors (can be pre-trained or trained from scratch).

• Use categorical crossentropy loss and an optimizer like RMSprop or Adam.

• Train with multiple epochs and evaluate on test data.

14. With code snippet explain the classification modeling using LSTM.

• LSTMs are a type of Recurrent Neural Network (RNN) specialized for sequential data.
• They capture long-range dependencies in text, remembering context from earlier words.
• The model includes an embedding layer followed by an LSTM layer and dense output layer.
• Dropout and recurrent dropout help reduce overfitting.
• Use binary crossentropy loss for binary classification and Adam optimizer.
• LSTMs generally take longer to train and need more data than CNNs.
20

15. Explain the case study on Corporate Ticketing using block diagram.

The main aim of this system is to automatically classify customer support tickets related to medical issues and
route them to the appropriate team starting with no labeled training data.

Phase 1: Initial Data Collection and Model Building

1. No Labeled Data Available

o The company begins without any labeled examples of medical-related tickets.


2. Create a Baseline Dataset Using: 21

o Public APIs (e.g., Google Cloud NLP, AWS Comprehend):

▪ Pre-trained services used to label and classify text into general or medical categories.

o Public Datasets (e.g., 20 Newsgroups – sci.med category):

▪ Leverage existing datasets to train a general model.

o Weak Supervision:

▪ Rule-based labelling (e.g., if text contains "fever", "nausea", label as medical).

▪ Tools: Snorkel, Crowdsourcing (e.g., Mechanical Turk, Figure Eight).

3. Build and Deploy Initial Model

o A basic classifier is trained on the weakly labeled data.

o Model is deployed to start making predictions in production.

Phase 2: Improved Model with Continuous Iteration

4. Monitor Performance & Collect Signals

o Explicit Feedback:

▪ Domain experts correct misrouted tickets (e.g., medical team rejects irrelevant cases).

o Implicit Feedback:

▪ Observe improvements in metrics like response time, resolution rate.

5. Active Learning Loop

o Model selects uncertain cases → sends them for human labeling.

o These new labels improve the dataset and retrain the model.

o Tool: Prodigy (for interactive annotation with model-in-the-loop).

6. Analyze & Iterate

o Continuous retraining with new feedback and data.

o Refine rules, improve data quality, and adjust model behavior.


Assignment-3: 22

1. With a clear diagram illustrating the General Pipeline of Information Extraction (IE), and Explain
each task depicted in the diagram?

• IE is a process used to extract structured information (like names, events, relations) from
unstructured text.
• It needs detailed NLP processing, more than what's required for simple text classification.
• The steps include breaking down the text, identifying key information, and understanding sentence
structure.
• IE uses evaluation metrics like precision, recall, and F1-score. The pipeline is flexible—not all tasks
are required every time.

Explanation of Each Task in the IE Pipeline Diagram:

1. Sentence Segmentation

• Breaks the raw text into individual sentences.


• Example: Turns a paragraph into separate sentences.
2. Word Tokenization 23

• Splits each sentence into individual words or tokens.


• Example: "Albert Einstein was a scientist" becomes ["Albert", "Einstein", "was", "a", "scientist"].

3. Part of Speech (POS) Tagging

• Assigns labels like noun, verb, adjective to each word.


• Helps in identifying phrases and named entities.

4. Named Entity Recognition (NER)

• Detects proper names like people, organizations, or locations.


• Example: "Albert Einstein" → Person, "NASA" → Organization.

5. Key Phrase Extraction

• Identifies important phrases from the text that summarize its meaning.
• Often uses POS tags to find nouns and noun phrases.

6. Syntactic Parsing

• Analyzes the grammar of sentences to find relationships between words.


• Helps understand the sentence structure.

7. Entity Disambiguation

• Ensures different mentions of an entity are correctly identified.


• Example: "Apple" (fruit) vs. "Apple" (company).

8. Coreference Resolution

• Finds when different words refer to the same thing.


• Example: "Einstein was brilliant. He developed the theory..." → "He" = "Einstein".

9. Relation Extraction / Event Extraction

• Detects how entities are related or what events occurred.


• Example: "Einstein developed the theory of relativity" → [Einstein, developed, theory].

2. Explain all the 7 IE tasks briefly.

1. Keyphrase Extraction (KPE)

• Goal: Identify important words/phrases that best summarize a document.

• Use: Auto-tagging, indexing, search optimization.


• Example: In product reviews, Amazon highlights key phrases like “battery life” or “camera 24
quality”.

2. Named Entity Recognition (NER)

• Goal: Detect and classify named entities in text (e.g., Person, Organization, Location, Date, Money).

• Use: Preprocessing step for many IE tasks like summarization, QA, MT.

• Example: “Steve Jobs founded Apple” → Steve Jobs (Person), Apple (Organization).

3. Named Entity Disambiguation & Linking (NEL)

• Goal: Assign a unique identity to detected entities by linking them to knowledge bases.

• Use: Resolves ambiguity (e.g., “Apple” = fruit or company?).

• Example: “Jaguar” → Car brand vs. animal, linked to Wikipedia/DBpedia entry.

4. Relationship Extraction (RE)

• Goal: Identify relationships between entities in a sentence or document.

• Use: Knowledge graph construction, semantic search.

• Example: “Sundar Pichai is the CEO of Google” → (Sundar Pichai, CEO, Google)

5. Event Extraction

• Goal: Extract events (who did what, when, where).

• Use: News summarization, timeline generation.

• Example: “Elon Musk launched Starship on Monday” → Event: Launch; Actor: Elon Musk; Object:
Starship; Time: Monday

6. Temporal Information Extraction

• Goal: Extract and normalize date/time expressions.

• Use: Calendar applications, scheduling assistants.

• Example: “Meeting at 3 PM today” → Date normalized as 2025-06-27 15:00

7. Template Filling

• Goal: Extract structured data from semi-structured text by filling predefined slots.

• Use: Automated reporting (weather, finance, sports).

• Example: Weather report → [City: Delhi, Temp: 35°C, Condition: Sunny]


3. Write a short note on KPE? 25

1. KPE is a task in Information Extraction (IE) under Natural Language Processing (NLP).

2. Its purpose is to extract important words or phrases that summarize the main idea of a text.

3. It helps in quickly identifying the core content of documents.

4. Common applications include search engine indexing, document tagging, recommendation systems,
and text summarization.

5. KPE is a less complex IE task and needs minimal pre-processing.

6. KPE methods are mainly divided into supervised and unsupervised techniques.

7. Supervised methods use labeled datasets but require manual annotation, which is costly and time-
consuming.

8. Unsupervised methods are more practical as they work without labeled data and are domain-
independent.

9. Graph-based algorithms like TextRank and SGRank are popular unsupervised techniques.

10. These algorithms treat words/phrases as nodes and rank them based on frequency and connection
strength.

11. Tools like textacy (built on spaCy) and gensim provide implementations for KPE.

12. KPE faces challenges such as overlapping keyphrases and length sensitivity in long documents.

13. Post-processing and applying custom filters or heuristics can improve the final output quality.

14. It’s important to clean and structure the text properly before extraction to get accurate results.

15. In real-world projects, KPE often works best when combined with domain-specific rules.

4. What are NER and How a NER system is built?

• NER stands for Named Entity Recognition, which is the process of identifying names of people,
organizations, locations, etc., in a given text.

• A simple way to build an NER system is by using a gazetteer, which is a list of known names (e.g., clients,
cities, companies).

• Gazetteer-based NER works by checking if a word appears in the list – if it does, it's tagged as a named
entity.
• The limitation of this method is that it doesn’t handle new names, name variations (e.g., USA vs. United
26
States), or context.

• A rule-based NER system uses patterns like word types and POS tags to detect entities. Example: If a word
tagged as a proper noun appears before "was born", it's likely a person.

• Libraries like Stanford’s RegexNER and spaCy’s EntityRuler help build rule-based NER systems.

• A more powerful method is training an ML model to detect named entities based on context and word
features.

• NER is a sequence labeling problem, where the label of one word depends on its surrounding words.

• Sequence classifiers, like Conditional Random Fields (CRF), are commonly used to train NER models.

• For training, we use labeled datasets like CONLL-03, where each word is tagged with a label (like B PER,
I-LOC, etc.).

• The BIO format is used in labeling:

1. B = Beginning of entity
2. I = Inside entity
3. O = Outside or not an entity

• To train an NER model, we follow four steps:

1. Load the dataset


2. Extract features
3. Train the classifier
4. Evaluate it

• Useful features include: if the word starts with a capital letter, its POS tag, and the POS tags of nearby
words.

• CRF models trained with such features can achieve high accuracy, like an F1 score of 0.92 in experiments.

• In real-world use, NER systems combine ML models, gazetteers, and rules for better results, since new
entities and domain-specific terms appear frequently.

5. Write a Short note on Named Entity Dis-ambiguity and Linking.

1. Named Entity Disambiguation (NED) means assigning a unique real-world identity to an entity
mentioned in text.

2. It helps identify what “Apple” refers to – the fruit or Apple Inc.


3. Named Entity Linking (NEL) = NER (identify entities) + NED (link to real-world meaning). 27

4. NEL links entities to knowledge bases like Wikipedia or Google Knowledge Graph.

5. It is useful in applications like search engines, chatbots, and question answering systems.

6. NEL helps build large knowledge bases by connecting people, places, organizations, etc.

7. It is important for news tagging, content recommendation, and event tracking.

8. NEL uses context around a word to determine its correct meaning.

9. Example: “Lincoln” could mean a car, a person, or a city – NEL finds the correct one.

10. It usually needs coreference resolution (e.g., "Einstein", "the scientist" = same person).

11. Also needs syntactic parsing to identify subject-verb-object relationships.

12. NEL is often done using supervised ML models, evaluated with precision, recall, F1-score.

13. Neural network-based approaches are commonly used in modern NEL systems.

14. Companies often use cloud services like Azure or IBM Watson for NEL instead of building from
scratch.

15. Quality of earlier NLP steps (cleaning, parsing) directly affects NEL performance.

6. What is Relation Extraction (RE)? Explain the RE approaches in brief.

What is Relation Extraction (RE)?

1. Relation Extraction (RE) is a task in Information Extraction (IE) that identifies and extracts relationships
between entities in a text.

2. It helps build knowledge bases by connecting people, organizations, events, and more using extracted
information.

3. RE is used in search engines, question answering systems, and financial/medical analysis by connecting
entities meaningfully.

4. For example, in the sentence "Luca Maestri is Apple’s finance chief", RE extracts the relation (Luca
Maestri, finance chief, Apple).

5. RE is more complex than Named Entity Recognition (NER) because it involves understanding the context
and connection between entities.

Approaches to Relation Extraction


1. Pattern-based Approach 28

2. Supervised Learning Approach

3. Semi-Supervised Learning (Bootstrapping)

4. Distant Supervision

5. Unsupervised Learning (Open IE)

1. Pattern-based Approach

• Uses manually created rules or templates (like regular expressions) to find specific relations.

• Example: If a sentence says "X, the CEO of Y", it extracts a relation like (X, CEO, Y).

• Simple but limited—doesn't cover all sentence types.

2. Supervised Learning Approach

• Treats RE like a classification problem using labeled data.

• Step 1: Check if two entities are related (yes/no).

• Step 2: If yes, identify what type of relation (e.g., founder, employee).

• Uses machine learning or neural networks trained on features like context words and syntax.

• Accurate but needs a lot of labeled data.

3. Semi-Supervised Learning (Bootstrapping)

• Starts with a few seed patterns or examples, then learns new patterns from data.

• Example: Start with one known pattern for "CEO of", then discover more similar phrases.

• Helpful when data is limited; expands patterns gradually.

4. Distant Supervision

• Uses existing knowledge bases like Wikipedia or Freebase to automatically label data.

• For example, if a database says "Elon Musk is the CEO of Tesla", and a sentence has both entities, it marks
that sentence as an example.

• Saves time by creating large datasets without manual labeling.

5. Unsupervised Learning (Open IE)

• Does not rely on predefined relations or training data.

• Extracts general tuples from sentences in the form: .


29

• Example: "Einstein published the theory of relativity in 1915" gives:

➢ <published, Einstein, theory of relativity>


➢ <published, Einstein, in 1915>

• Flexible and broad, but hard to map results to fixed relation types.

7. Explain in detail the taxonomy of chat bot.

1. Chatbots can be classified based on how they interact with users.

2. The three main types are: FAQ bots, Flow-based bots, and Open-ended bots.

1. FAQ (Exact Answer) Bots

3. FAQ bots give fixed, pre-defined answers to common questions.

4. Each user query is treated independently, without context from earlier conversation.

5. These bots are useful for retrieving direct answers, like customer service FAQs.

6. They can handle slightly varied user inputs for the same question.

2. Flow-Based Bots

7. Flow-based bots are more interactive and guided than FAQ bots.

8. They follow a structured conversation flow to achieve a goal.

9. They can remember earlier inputs and maintain context.

10. Common use cases include order placing (e.g., pizza ordering bots).

3. Open-Ended Bots

11. Open-ended bots are designed for free-flowing conversations.

12. They are used mainly for entertainment or companionship.

13. These bots can switch topics easily and are not goal-driven.
Broader Classification 30

14. Broadly, chatbots are either goal-oriented (e.g., FAQ, flow-based) or chitchat-based (e.g., open-
ended).

15. Goal-oriented bots are domain-specific, while chitchat bots aim for open-domain, natural
conversation.

8. Explain in detail the pipeline for building dialogue systems and components of dialog system.

This pipeline shows how a dialog system (like Siri, Alexa, or Google Assistant) processes your voice input
and responds back in natural language.

1. Speech Recognition (ASR - Automatic Speech Recognition) o Converts spoken words into text.

o Example: You say "What's the weather today?" → It becomes the text “What’s the weather today?”

2. Natural Language Understanding (NLU) o Understands the meaning and intent behind the text.

o It identifies the user's intent (e.g., asking about the weather) and extracts important information (like
location or date).

3. Dialog Manager (DM) o Acts like the brain of the system.

o It keeps track of the conversation, decides what to do next, and communicates with the Task Manager.

4. Task Manager

o Performs the actual task requested by the user (e.g., checking weather, setting alarms, fetching facts).

o It gets or processes the required data.

5. Natural Language Generation (NLG)


o Converts the system's response or data (like “27°C and sunny”) into a human-like sentence. 31

o Example: “The weather today is sunny with a high of 27°C.”

6. Text-to-Speech Synthesis (TTS)

o Converts the generated sentence into spoken words.

o Example: The system says out loud: “The weather today is sunny with a high of 27°C.”

Components of a Dialog System:

1. Speech Recognition (ASR)

• What it does: Converts voice input into written text.

• Analogy: Like typing out what someone is saying.

• Goal: Understand what the user said, in text form.

2. Natural Language Understanding (NLU)

• What it does: Understands the meaning of the user’s text.

• Analogy: Like a friend who not only hears your words but understands what you want.

• Includes: o Intent recognition (e.g., weather inquiry)

o Entity extraction (e.g., date, city)

3. Dialog Manager (DM)

• What it does: Controls the conversation flow.

• Analogy: Like a smart receptionist who remembers what you said and responds accordingly.

• Tasks: o Keeps context of the dialog

o Chooses the next action (ask, answer, confirm, etc.)

4. Task Manager

• What it does: Executes the user’s request.

• Analogy: Like a worker who performs the job the receptionist assigns.

• Example: Calls weather API to get temperature data.

5. Natural Language Generation (NLG)

• What it does: Converts raw data into natural sentences.

• Analogy: Like turning bullet points into a smooth paragraph.


• Goal: Make the response sound human. 32

6. Text-to-Speech Synthesis (TTS)

• What it does: Speaks the response out loud.

• Analogy: Like a robot reading the response in a human voice.

• Goal: Provide a natural voice reply.

9. Which two categories are chatbots classified as? Explain briefly.

1. Chatbots are broadly divided into two main categories:

o Goal-Oriented Dialogues

o Chitchats

1. Goal-Oriented Dialogues

2. These chatbots are designed to help users complete specific tasks (e.g., booking, ordering,
recommending).

3. They follow a structured, task-focused conversation flow.

4. Common types include:

o FAQ bots (fixed responses)

o Flow-based bots (guided conversations)

5. These bots are domain-specific, requiring knowledge of a particular field.

6. Due to this specificity, they may face scalability and generalisability issues.

7. Internally, they use:

o Dialog Act Classification to detect intent

o Slot Filling to extract important info (like location, time, item)

o A Dialog Manager to control conversation flow

8. Modern approaches (e.g., Facebook’s research) try to use end-to-end training methods to improve
goal-oriented chatbots.
2. Chitchats 33

9. These chatbots are meant for free-form, open-ended conversation, often used for entertainment or
emotional support.

10. They don’t follow a fixed goal or task; instead, they engage on a variety of topics.

11. Future use cases include healthcare (e.g., mental support) and addressing loneliness, especially in the
elderly.

12. A major challenge is generating coherent, accurate, and fact-based responses.

13. Tech giants like Amazon, Apple, and Google are heavily investing in improving chitchat bots, but
lack of natural datasets remains a big hurdle.

10. How does dialog act classification works.

1. Dialogue act classification identifies the intent or purpose behind a user's message in a conversation.

2. It helps determine what the user wants, enabling the chatbot to respond appropriately.

3. Also known as intent classification, it is central to conversational AI.

4. For example, "I want to order pizza" may be classified as orderPizza intent.

5. Utterances like "Are you going to school today?" are classified as yes/no questions.

6. These dialogue acts/intents are pre-defined and based on the chatbot's domain.

7. It is a key part of the Natural Language Understanding (NLU) module.

8. The task is modeled as a classification problem using machine learning.

9. Each utterance is assigned to one or more predefined categories.

10. CNNs are used to capture local text patterns (like word n-grams) for intent prediction.

11. Pre-trained models like BERT are highly accurate for this task, achieving over 98% accuracy on
datasets like ATIS.

12. These models use deep contextual understanding to predict the correct dialog act.

13. Labelled training data is essential for building custom systems from scratch.

14. Commercial tools like Google Dialogflow or Microsoft LUIS offer easy-to-use, off-the-shelf intent
classifiers.
15. Accurate intent classification depends on earlier steps, like speech-to-text; errors here can reduce
34
performance.

11. Briefly explain regarding response generation.

1. Response generation is the final step in a dialogue system, where the system formulates and delivers
a reply.

2. It is based on the intent, slots, and dialogue context passed by the dialogue manager.

3. The goal is to generate appropriate, human-readable responses to continue the conversation


smoothly.

Fixed Responses

4. Used in simple FAQ bots, where each intent maps to a predefined response.

5. Responses are retrieved via dictionary lookup or ranking from a pool.

6. Slot values may be ignored or minimally used.

Template-Based Generation

7. Responses are generated using sentence templates filled with slot values.

8. Example: “The House serves cheap Thai food.”

9. Provides grammatical correctness and better control than automatic generation.

10. Suitable for clarifying questions and fact-based replies.

Automatic Generation

11. Uses deep learning models like seq2seq or reinforcement learning for dynamic, fluent replies.

12. These models generate responses from scratch based on conversation state.

13. While flexible, they may lack factual accuracy and are harder to control.

Challenges and Trade-offs


14. Template methods offer reliability and quality, but automatic generation brings variety and 35
naturalness.

15. A shortage of high-quality conversational datasets and evaluation difficulties remain major
challenges.

12. Explain in detail the end – to – end approach.

1. The end-to-end approach replaces traditional modular chatbot design with a single trainable model.

2. It typically uses sequence-to-sequence (seq2seq) deep learning models.

3. The model takes the entire user input (sequence of words) and directly generates the bot's response
(another word sequence).

4. Unlike modular systems, it does not require separate modules for NLU, dialogue management, and
NLG.

5. It removes the need for an explicitly defined ontology or intent-slot structure.

6. This approach simplifies training by eliminating the need for annotated datasets for individual
components.

7. Transformer models (e.g., GPT, BERT variants) are now widely used over older LSTM-based
seq2seq models.

8. These models effectively capture context and token order, improving natural language
understanding.

9. One limitation is their tendency to produce generic responses like “I don’t know.”

10. To address this, deep reinforcement learning can be used to train the model to give goal-oriented
replies.

11. End-to-end models often have large numbers of parameters, making them computationally
expensive.

12. This makes deployment on low-resource or small-scale devices challenging.

13. These models may also generate factually incorrect or inconsistent responses, affecting real-world
usability.

14. A hybrid approach combining end-to-end generation with human supervision can improve reliability.

15. Despite challenges, the end-to-end method is a powerful tool for building natural, fluent, and open-
domain chatbots.
36

13. What are Chat bots? What are its benefits and applications?

1. Chatbots are AI-based systems that interact with users using natural language (text or speech).

2. Their main goal is to understand user input and provide relevant responses.

3. Natural Language Processing (NLP) is central to how chatbots understand and generate language.

4. Chatbots are classified into:

o Goal-oriented bots (e.g., booking, ordering).

o Chitchat bots (e.g., general conversation).

5. Goal-oriented bots help users complete specific tasks like placing orders or booking tickets.

6. Chitchat bots are for open-ended, casual conversations, useful in entertainment or emotional support.

7. A key benefit is enabling hands-free, voice-based interaction, removing the need for screens or
keyboards.

8. Chatbots became more popular due to smartphones and advances in Machine Learning (ML) and
Deep Learning (DL).

9. Tools like Dialogflow help even non-experts to easily create chatbots using cloud APIs.

10. Dialog act classification detects the intent behind the user’s message (e.g., asking a question or
placing an order).

11. Slot identification extracts specific details or entities related to the intent (e.g., size: medium, food:
pizza).

12. The response generation module creates replies using:

o Fixed responses

o Templates

o Automatically generated text

13. Chatbots are used in e-commerce, news discovery, and customer service (e.g., FAQs, order updates).

14. Other applications include healthcare (e.g., symptom checkers like Woebot) and legal services (e.g.,
answering basic legal queries).

15. Hybrid systems (chatbot + human review) are recommended for complex or sensitive tasks, ensuring
both accuracy and reliability.
37

14. Explain Human in the loop

1. Human-in-the-loop means humans actively intervene in a machine’s learning or decision-making


process.

2. The main goal is to improve machine performance by providing human feedback or corrections.

3. Humans act as “teachers” who give rewards or penalties based on the machine’s outputs.

4. This approach is especially useful in reinforcement learning, where the system learns by trial and
error.

5. HITL helps ensure chatbots better fulfill user needs by guiding their learning.

6. It is considered more practical and reliable than fully automated dialogue systems.

7. End-to-end models, although efficient, may fail to generate factually correct or appropriate
responses.

8. Therefore, hybrid systems combine automatic generation with human oversight for greater accuracy
and robustness.

9. Humans step in when the bot’s understanding or action is incorrect, uncertain, or out-of-scope.

10. For example, uncertain classification decisions can be deferred to human evaluators.

11. Facebook uses HITL by having humans provide partial rewards during bot training, improving
response quality.

12. HITL is feasible even with limited computing resources.

13. It promotes responsible AI by supporting fairness, transparency, and accountability, treating AI as


augmented intelligence that assists humans rather than replaces them.

15. Explain how Deep Reinforcement Learning techniques are used for Dialogue generation.

1. DRL tackles a key limitation of traditional sequence-to-sequence (seq2seq) models that often
generate generic or dull responses like "I don't know."

2. Typical seq2seq models lack foresight about how to carry a good conversation toward a meaningful
goal.
3. "Good conversation" means different things depending on the dialogue type and objective. 38

4. For goal-oriented dialogues, "goodness" means successfully helping the user achieve their specific
goal (e.g., booking a flight).

5. For chitchat/open-ended conversations, "goodness" means keeping the interaction engaging and
interesting.

6. DRL combines goal-driven dialogue management with seq2seq response generation.

7. In DRL, each system response is viewed as an action taken by the agent in the dialogue environment.

8. The system learns to select a sequence of actions that maximize achieving the overall conversation
goal.

9. Learning happens through exploration (trying new responses) and exploitation (using known good
responses).

10. The model receives a "futuristic reward", a feedback signal guiding it toward better long-term
dialogue outcomes.

11. DRL-trained models produce more diverse and goal-focused replies, reducing repetitive or generic
outputs.

12. Human-in-the-loop methods complement DRL by letting humans provide feedback


(rewards/penalties) during learning.

13. Human intervention helps when the chatbot misinterprets a query, takes wrong actions, or faces out-
of-scope inputs.

14. Facebook’s implementation showed that injecting partial human rewards significantly improved
chatbot response quality.

15. Combining DRL with human feedback creates hybrid dialogue systems that are more robust, reliable,
and suitable for real-world applications.

You might also like