[go: up one dir, main page]

0% found this document useful (0 votes)
25 views15 pages

DAV Module 4

Text mining has diverse applications including sentiment analysis, named entity recognition, and topic modeling, which help extract insights from unstructured text. Summarization techniques are utilized to condense information from news articles, research papers, and customer support tickets. The text analysis process involves steps such as parsing, search and retrieval, and text mining, with techniques like tokenization, stemming, lemmatization, and TF-IDF playing crucial roles in natural language processing.

Uploaded by

killerchirag011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

DAV Module 4

Text mining has diverse applications including sentiment analysis, named entity recognition, and topic modeling, which help extract insights from unstructured text. Summarization techniques are utilized to condense information from news articles, research papers, and customer support tickets. The text analysis process involves steps such as parsing, search and retrieval, and text mining, with techniques like tokenization, stemming, lemmatization, and TF-IDF playing crucial roles in natural language processing.

Uploaded by

killerchirag011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MODULE 4

TEXT ANALYTICS

Applications and Use Cases of Text Mining


Text mining has a wide range of applications across industries. Below are some key applications and use
cases for extracting meaning from unstructured text and summarizing text:

1. Extracting Meaning from Unstructured Text


This involves identifying patterns, sentiments, and relevant information from large volumes of text data.
a) Sentiment Analysis
●​ Used in social media monitoring, customer feedback, and brand perception analysis.
●​ Helps businesses understand customer emotions toward products or services.
b) Named Entity Recognition (NER)
●​ Extracts names, locations, dates, and organizations from news articles, legal documents, and
financial reports.
●​ Used in fraud detection, legal case research, and journalism.
c) Topic Modeling
●​ Clusters text into meaningful topics using Latent Dirichlet Allocation (LDA) or Non-negative
Matrix Factorization (NMF).
●​ Applied in news categorization, academic research, and content recommendation.
d) Information Retrieval
●​ Helps search engines like Google extract relevant results from unstructured text.
●​ Used in legal case studies, patent search, and medical diagnosis.
e) Opinion Mining
●​ Extracts subjective information from product reviews, blogs, and social media.
●​ Helps companies analyze public sentiment about their brand.
f) Spam Detection & Fake News Detection
●​ Identifies and filters spam emails, fake reviews, and misinformation.
●​ Used in email security, social media moderation, and news authenticity checks.
g) Chatbots & Virtual Assistants
●​ Understands user intent and responds accordingly (e.g., Siri, Alexa, ChatGPT).
●​ Applied In customer support, automated answering systems, and personalized
recommendations.

2. Summarizing Text
Summarization techniques help reduce large volumes of text while retaining essential information.

a) News Summarization
●​ Automatically summarizes news articles for quick consumption.
●​ Used in news apps, financial market analysis, and political briefings.
b) Academic & Research Paper Summarization
●​ Summarizes long research papers into abstracts or highlights.
●​ Used in scientific literature reviews, legal case summaries, and patent analysis.
c) Meeting Minutes & Transcription Summarization
●​ Converts long meeting transcripts into concise minutes.
●​ Used in corporate meetings, legal hearings, and conference summaries.
d) Legal Document Summarization
●​ Summarizes lengthy contracts, court cases, and regulatory documents.
●​ Helps lawyers, paralegals, and compliance teams quickly extract key insights.
e) Customer Support Ticket Summarization
●​ Condenses customer complaints and inquiries into key issues.
●​ Helps in efficient issue resolution and trend analysis.
f) Summarizing Books & Reports
●​ AI-powered tools like Blinkist summarize books into key takeaways.
●​ Used for executive briefings, student learning, and book reviews.
Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing, search and retrieval, and text
mining. Note that a text analysis problem may also consist of other subtasks (such as discourse and
segmentation) that are outside the scope of this book.
Parsing
●​ It is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a
HyperText Markup Language (HTML) file, or a Word document.
●​ Parsing deconstructs the provided text and renders it in a more structured way for the
subsequent steps.
Search and retrieval
●​ Search and retrieval is the identification of the documents in a corpus that contain search items
such as specific words, phrases, topics, or entities like people or organizations.
●​ These search items are generally called key terms. Search and retrieval originated from the field
of library science and is now used extensively by web search engines.
Text mining
●​ Text mining uses the terms and indexes produced by the prior two steps to discover meaningful
insights pertaining to domains or problems of interest.
●​ With the proper representation of the text, many of the techniques mentioned in the previous
chapters, such as clustering and classification, can be adapted to text mining.
●​ For example, the k-means can be modified to cluster text documents into groups, where each
group represents a collection of documents with a similar topic. The distance of a document to a
centroid represents how closely the document talks about that topic.
●​ Classification tasks such as sentiment analysis and spam filtering are prominent use cases for
the naïve Bayes classifier.
●​ Text mining may utilize methods and techniques from various fields of study, such as statistical
analysis, information retrieval, data mining, and natural language processing.

A Text Analysis Example


❖​ To further describe the three text analysis steps, consider the fictitious company ACME, maker of
two products: bPhone and bEbook.
❖​ ACME is in strong competition with other companies that manufacture and sell similar products.
❖​ To succeed, ACME needs to produce excellent phones and eBook readers and increase sales.
❖​ One of the ways the company does this is to monitor what is being said about ACME products in
social media. In other words, what is the buzz on its products?
❖​ ACME wants to search all that is said about ACME products in social media sites, such as Twitter
and
❖​ Facebook, and popular review sites, such as Amazon and ConsumerReports.
❖​ It wants to answer questions such as these.
●​ Are people mentioning its products?
●​ What is being said?
●​ Are the products seen as good or bad?
●​ If people think an ACME product is bad, why?
●​ For example, are they complaining about the battery life of the bPhone, or the response
time in their bEbook?
❖​ ACME can monitor the social media buzz using a simple process based on the three steps
Outlined :-
➢​ 1. Collect raw text : This corresponds to Phase 1 and Phase 2 of the Data Analytic
Lifecycle. In this step, the Data Science team at ACME monitors websites for references
to specific products. The websites may include social media and review sites. The team
could interact with social network application programming interfaces (APIs) , process
data feeds, or scrape pages and use product names as keywords to get the raw data.
Regular expressions are commonly used in this case to identify text that matches certain
patterns. Additional filters can be applied to the raw data for a more focused study. For
example, only retrieving the reviews originating in New York instead of the entire United
States would allow ACME to conduct regional studies on its products. Generally, it is a
good practice to apply filters during the data collection phase. They can reduce I/O
workloads and minimize the storage
requirements.
➢​ 2. Represent text. Convert each review into a suitable document representation with
proper indices, and build a corpus based on these indexed reviews. This step
corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
➢​ 3. Compute the usefulness of each word in the reviews using methods such as TFIDF.
This and the following two steps correspond to Phases 3 through 5 of the Data Analytic
Lifecycle.
➢​ 4. Categorize documents by topics. This can be achieved through topic models (such as
latent Dirichlet allocation).
➢​ 5. Determine sentiments of the reviews. Identify whether the reviews are positive or
negative. Many product review sites provide ratings of a product with each review. If such
information is not available, techniques like sentiment analysis can be used on the
textual data to infer the underlying sentiments. People can express many emotions. To
keep the process simple, ACME considers sentiments as positive,neutral, or negative.
➢​ 6. Review the results and gain greater insights (Section 9.8). This step corresponds to
Phase 5 and 6 of the Data Analytic Lifecycle. Marketing gathers the results from the
previous steps. Find out what exactly makes people love or hate a product. Use one or
more visualization techniques to report the findings. Test the soundness of the
conclusions and operationalize the findings if applicable.

ACME’s Text Analysis Process


Tokenization
Tokenization refers to breaking down the text into smaller units. It entails splitting paragraphs into
sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us have a look
at the two major kinds of tokenization that NLTK provides:

Work Tokenization

It involves breaking down the text into words.

"I study Machine Learning on GeeksforGeeks." will be word-tokenized as

['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

Sentence Tokenization

It involves breaking down the text into individual sentences.

Example:

"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"

will be sentence-tokenized as

['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

# Tokenization using NLTK


from nltk import word_tokenize, sent_tokenize
sent = "GeeksforGeeks is a great learning platform.\
It is one of the best for Computer Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))

Output:

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',

'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']

['GeeksforGeeks is a great learning platform.',

'It is one of the best for Computer Science students.']

Stemming and Lemmatization


When working with Natural Language, we are not much interested in the form of words – rather, we are
concerned with the meaning that the words intend to convey. Thus, we try to map every word of the
language to its root/base form. This process is called canonicalization.

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all
to their base form i.e. ‘play’.
Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

Stemming
Stemming generates the base word from the inflected word by removing the affixes of the word. It has a
set of predefined rules that govern the dropping of these affixes. It must be noted that stemmers might
not always result in semantically meaningful base words. Stemmers are faster and computationally less
expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer – one of the most widely used
stemmers:

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer


porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

Output:

play

play

play

play

We can see that all the variations of the word ‘play’ have been reduced to the same word – ‘play’. In this
case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in the case of
a stemmer.

from nltk.stem import PorterStemmer


# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("Communication"))

Output:

commun

The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

Lemmatization
Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach
out to the base form of any word which will be meaningful in nature. The base from here is called the
Lemma.
Lemmatizers are slower and computationally more expensive than stemmers.

Example:

'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk.stem import WordNetLemmatizer


# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))

Output:

play

play

play

play

Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the word as a
function argument.

Also, lemmatizers always result in meaningful base words. Let us take the same example as we took in
the case for stemmers.

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer


lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))

Output:

Communication

Part of Speech Tagging


Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is
significant as it helps to give a better syntactic overview of a sentence.

Example:

"GeeksforGeeks is a Computer Science platform."

Let's see how NLTK's POS tagger will tag this sentence.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk import pos_tag


from nltk import word_tokenize

text = "GeeksforGeeks is a Computer Science platform."


tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
tags

Output:

[('GeeksforGeeks', 'NNP'),

('is', 'VBZ'),

('a', 'DT'),

('Computer', 'NNP'),

('Science', 'NNP'),

('platform', 'NN'),

('.', '.')]
TFIDF-
Term frequency-inverse document frequency (TF-IDF) is a natural language processing (NLP) technique
that’s used to measure the importance of different words in a sentence. It cancels out the incapabilities of
the bag of words technique, which is good for text classification or for helping a machine learning model
read words.
TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of
different words in a sentence. It’s useful in text classification and for helping a machine learning model
read words.

●​ Terminology
●​ Term frequency(TF)
●​ Document frequency (DF)
●​ Inverse document frequency (IDF)
●​ TF-IDF Implementation in Python

Below are the terms you’ll need to understand to create a TF-IDF model.

●​ t — term (word).
●​ d — document (set of words).
●​ N — count of corpus.
●​ corpus — the total document set.

What Is Term Frequency (TF) in TF-IDF?


Term Frequency (TF): Measures how often a word appears in a document. A higher frequency suggests
greater importance. If a term appears frequently in a document, it is likely relevant to the document’s
content.

Formula:

Limitations of TF Alone:

●​ TF does not account for the global importance of a term across the entire corpus.
●​ Common words like “the” or “and” may have high TF scores but are not meaningful in
distinguishing documents.

What Is Document Frequency in TF-IDF?


Document frequency (DF) measures the importance of a document in a whole set of corpus. This is very
similar to TF. The only difference is that TF is a frequency counter for a term t in document d, whereas DF
is the count of occurrences of term t in the document set N. In other words, DF is the number of
documents in which the word is present. We consider one occurrence if the term consists in the
document at least once. We don’t need to know the number of times the term is present.

DF Formula
df(t) = occurrence of t in documents

What Is Inverse Document Frequency (IDF) in TF-IDF?


Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents
while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be
meaningful and specific.

Formula:

The logarithm is used to dampen the effect of very large or very small values, ensuring the IDF score
scales appropriately.

It also helps balance the impact of terms that appear in extremely few or extremely many documents.

Limitations of IDF Alone:

●​ IDF does not consider how often a term appears within a specific document.
●​ A term might be rare across the corpus (high IDF) but irrelevant in a specific document (low TF).

How Does TF-IDF Work?


TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus.
There are many different variations of TF-IDF, but for now, let’s concentrate on the basic version.

TF-IDF Formula

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

Converting Text into vectors with TF-IDF : Example


To better grasp how TF-IDF works, let’s walk through a detailed example. Imagine we have a corpus (a
collection of documents) with three documents:

Document 1: “The cat sat on the mat.”

Document 2: “The dog played in the park.”


Document 3: “Cats and dogs are great pets.”

Our goal is to calculate the TF-IDF score for specific terms in these documents. Let’s focus on the word
“cat” and see how TF-IDF evaluates its importance.

Step 1: Calculate Term Frequency (TF)

For Document 1:

The word “cat” appears 1 time.

The total number of terms in Document 1 is 6 (“the”, “cat”, “sat”, “on”, “the”, “mat”).

So, TF(cat,Document 1) = 1/6

For Document 2:

The word “cat” does not appear.

So, TF(cat,Document 2)=0.

For Document 3:

The word “cat” appears 1 time (as “cats”).

The total number of terms in Document 3 is 6 (“cats”, “and”, “dogs”, “are”, “great”, “pets”).

So, TF(cat,Document 3)=1/6

●​ In Document 1 and Document 3, the word “cat” has the same TF score. This means it appears
with the same relative frequency in both documents.
●​ In Document 2, the TF score is 0 because the word “cat” does not appear.

Step 2: Calculate Inverse Document Frequency (IDF)

Total number of documents in the corpus (D): 3

Number of documents containing the term “cat”: 2 (Document 1 and Document 3).

So,

IDF(cat,D)=log 3/2 ≈0.176

The IDF score for “cat” is relatively low. This indicates that the word “cat” is not very rare in the corpus—it
appears in 2 out of 3 documents. If a term appeared in only 1 document, its IDF score would be higher,
indicating greater uniqueness.

Step 3: Calculate TF-IDF

The TF-IDF score for “cat” is 0.029 in Document 1 and Document 3, and 0 in Document 2 that reflects
both the frequency of the term in the document (TF) and its rarity across the corpus (IDF).
1. Collecting Raw Text
Raw text is unstructured data gathered from various sources. This is the first step in text mining, where
data is collected for further processing.

Sources of Raw Text:


Web Scraping: Extracting data from websites (e.g., news articles, product reviews).
Social Media Feeds: Data from Twitter, Facebook, Reddit, etc.
Customer Feedback & Reviews: Extracting text from product reviews, surveys, and complaints.
Emails & Chat Logs: Used in spam detection, fraud analysis, or customer support automation.
Legal & Government Documents: Extracting information from contracts, case laws, regulations.
Academic Papers & Reports: Used for research summarization and literature analysis.
Challenges in Collecting Raw Text:
Data Noise: Unnecessary information, such as advertisements or irrelevant text.
Different Formats: Data may be in PDFs, scanned images (OCR required), or plain text.
Data Privacy & Ethics: Ensuring proper data collection policies (e.g., GDPR compliance).

📌
Tools:

📌
Scrapy, BeautifulSoup (Python) for web scraping

📌
Twitter API, Facebook Graph API for social media data
OCR (Tesseract) for scanned documents

2. Representing Text (Text Preprocessing & Vectorization)


Once raw text is collected, it needs to be transformed into a machine-readable format before analysis.

Text Preprocessing Steps:


Tokenization – Splitting text into words or sentences.
Example:
"Text mining is powerful" → ['Text', 'mining', 'is', 'powerful']

Stopword Removal – Removing common words (e.g., "is", "the", "and") that do not add much meaning.
Example:
"Text mining is powerful" → ['Text', 'mining', 'powerful']

Lemmatization / Stemming – Reducing words to their root form.


Example:
"running", "ran", "runs" → "run"

Removing Punctuation & Special Characters – Cleaning unnecessary symbols.

Lowercasing – Converting text to lowercase to ensure uniformity.

Text Representation (Vectorization Techniques)


After preprocessing, text must be represented numerically.

1. Bag of Words (BoW)


Converts text into a word frequency matrix.
Example:
arduino

"Machine learning is fun. Learning is powerful."


Word​ Count
Machine​ 1
Learning​ 2
Fun​ 1
Powerful​ 1

2. TF-IDF (Term Frequency-Inverse Document Frequency)


Assigns importance to words based on their frequency across multiple documents.
Rare words get higher weights, while common words get lower scores.

3. Word Embeddings (Word2Vec, GloVe, BERT)


Converts words into dense vector representations that capture meaning and context.

📌
Tools:

📌
NLTK, SpaCy for text preprocessing

📌
Scikit-learn (CountVectorizer, TfidfVectorizer) for vectorization
Word2Vec, GloVe, BERT for advanced embeddings

3. Categorizing Documents by Topics (Topic Modeling & Text Classification)


Once text is processed, it can be classified into different topics.

Methods for Topic Categorization


1. Rule-Based Classification
Uses manually defined keywords or patterns to assign topics.
Example: Emails containing "invoice" or "payment" → Finance category.
2. Machine Learning-Based Text Classification
Supervised Learning: Uses labeled training data.
Example: Naïve Bayes, SVM, Random Forest, Deep Learning (LSTMs, Transformers)
Unsupervised Learning: Discovers topics without predefined labels.
Example: K-Means Clustering, Hierarchical Clustering
3. Topic Modeling (Unsupervised)
Identifies hidden themes in large text datasets.
LDA (Latent Dirichlet Allocation): Groups documents into probabilistic topics.
Example: News articles → Topics like "Politics," "Technology," "Sports."

📌
Tools:

📌
Scikit-learn, FastText, TensorFlow/Keras for text classification
Gensim (LDA, LSA, NMF) for topic modeling

4. Determining Sentiments (Sentiment Analysis)


Sentiment analysis determines whether text is positive, negative, or neutral.
Approaches to Sentiment Analysis
1. Lexicon-Based Approach
Uses predefined sentiment dictionaries (e.g., VADER, SentiWordNet).
Example:
mathematica
"This movie is amazing!" → Positive (+0.9 sentiment score)
2. Machine Learning Approach
Uses classifiers (e.g., Naïve Bayes, SVM, Logistic Regression) trained on sentiment-labeled data.
Example:
java
"The food was terrible." → Negative sentiment (-0.8 score)
3. Deep Learning Approach
Uses LSTMs, Transformers (BERT, RoBERTa) for advanced sentiment detection.


Use Cases of Sentiment Analysis


Customer Feedback – Analyzing user reviews (e.g., Amazon, Yelp).


Stock Market Predictions – Using Twitter news to predict trends.
Brand Monitoring – Tracking brand reputation on social media.

📌
Tools:

📌
VADER (for short texts), TextBlob (Python) for lexicon-based methods
Scikit-learn, BERT, RoBERTa for ML-based sentiment analysis

5. Gaining Insights from Text Data


Extracting meaningful insights from text helps in decision-making and trend analysis.

Methods for Gaining Insights


1. Word Frequency Analysis
Identifies the most common words in a dataset.
Example: In customer feedback, "slow delivery" appears often → Key concern area.
2. Named Entity Recognition (NER)
Identifies key entities (e.g., People, Locations, Dates).
Example:
arduino

"Apple announced a new iPhone in California."


Entities Extracted:
Apple → Organization
iPhone → Product
California → Location
3. Text Clustering & Trend Analysis
Groups similar documents based on content similarity.
Used in news aggregation, customer segmentation.
4. Summarization Techniques
Extractive Summarization: Selects key sentences (e.g., TextRank, LexRank).
Abstractive Summarization: Generates new sentences (e.g., Transformers like T5, BART).


Use Cases of Text Insights


Legal & Compliance – Extracting key clauses from contracts.


Healthcare & Medical – Summarizing patient records, medical literature.
Business Intelligence – Analyzing trends in customer complaints.

📌
Tools:
spaCy, NLTK, Gensim (TextRank), HuggingFace Transformers for insights extraction

You might also like