0% found this document useful (0 votes)

25 views15 pages

DAV Module 4

Text mining has diverse applications including sentiment analysis, named entity recognition, and topic modeling, which help extract insights from unstructured text. Summarization techniques are utilized to condense information from news articles, research papers, and customer support tickets. The text analysis process involves steps such as parsing, search and retrieval, and text mining, with techniques like tokenization, stemming, lemmatization, and TF-IDF playing crucial roles in natural language processing.

Uploaded by

killerchirag011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views15 pages

DAV Module 4

Uploaded by

killerchirag011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

MODULE 4

TEXT ANALYTICS

Applications and Use Cases of Text Mining

Text mining has a wide range of applications across industries. Below are some key applications and use
cases for extracting meaning from unstructured text and summarizing text:

1. Extracting Meaning from Unstructured Text

This involves identifying patterns, sentiments, and relevant information from large volumes of text data.
a) Sentiment Analysis
● Used in social media monitoring, customer feedback, and brand perception analysis.
● Helps businesses understand customer emotions toward products or services.
b) Named Entity Recognition (NER)
● Extracts names, locations, dates, and organizations from news articles, legal documents, and
financial reports.
● Used in fraud detection, legal case research, and journalism.
c) Topic Modeling
● Clusters text into meaningful topics using Latent Dirichlet Allocation (LDA) or Non-negative
Matrix Factorization (NMF).
● Applied in news categorization, academic research, and content recommendation.
d) Information Retrieval
● Helps search engines like Google extract relevant results from unstructured text.
● Used in legal case studies, patent search, and medical diagnosis.
e) Opinion Mining
● Extracts subjective information from product reviews, blogs, and social media.
● Helps companies analyze public sentiment about their brand.
f) Spam Detection & Fake News Detection
● Identifies and filters spam emails, fake reviews, and misinformation.
● Used in email security, social media moderation, and news authenticity checks.
g) Chatbots & Virtual Assistants
● Understands user intent and responds accordingly (e.g., Siri, Alexa, ChatGPT).
● Applied In customer support, automated answering systems, and personalized
recommendations.

2. Summarizing Text
Summarization techniques help reduce large volumes of text while retaining essential information.

a) News Summarization
● Automatically summarizes news articles for quick consumption.
● Used in news apps, financial market analysis, and political briefings.
b) Academic & Research Paper Summarization
● Summarizes long research papers into abstracts or highlights.
● Used in scientific literature reviews, legal case summaries, and patent analysis.
c) Meeting Minutes & Transcription Summarization
● Converts long meeting transcripts into concise minutes.
● Used in corporate meetings, legal hearings, and conference summaries.
d) Legal Document Summarization
● Summarizes lengthy contracts, court cases, and regulatory documents.
● Helps lawyers, paralegals, and compliance teams quickly extract key insights.
e) Customer Support Ticket Summarization
● Condenses customer complaints and inquiries into key issues.
● Helps in efficient issue resolution and trend analysis.
f) Summarizing Books & Reports
● AI-powered tools like Blinkist summarize books into key takeaways.
● Used for executive briefings, student learning, and book reviews.
Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing, search and retrieval, and text
mining. Note that a text analysis problem may also consist of other subtasks (such as discourse and
segmentation) that are outside the scope of this book.
Parsing
● It is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a
HyperText Markup Language (HTML) file, or a Word document.
● Parsing deconstructs the provided text and renders it in a more structured way for the
subsequent steps.
Search and retrieval
● Search and retrieval is the identification of the documents in a corpus that contain search items
such as specific words, phrases, topics, or entities like people or organizations.
● These search items are generally called key terms. Search and retrieval originated from the field
of library science and is now used extensively by web search engines.
Text mining
● Text mining uses the terms and indexes produced by the prior two steps to discover meaningful
insights pertaining to domains or problems of interest.
● With the proper representation of the text, many of the techniques mentioned in the previous
chapters, such as clustering and classification, can be adapted to text mining.
● For example, the k-means can be modified to cluster text documents into groups, where each
group represents a collection of documents with a similar topic. The distance of a document to a
centroid represents how closely the document talks about that topic.
● Classification tasks such as sentiment analysis and spam filtering are prominent use cases for
the naïve Bayes classifier.
● Text mining may utilize methods and techniques from various fields of study, such as statistical
analysis, information retrieval, data mining, and natural language processing.

A Text Analysis Example

❖ To further describe the three text analysis steps, consider the fictitious company ACME, maker of
two products: bPhone and bEbook.
❖ ACME is in strong competition with other companies that manufacture and sell similar products.
❖ To succeed, ACME needs to produce excellent phones and eBook readers and increase sales.
❖ One of the ways the company does this is to monitor what is being said about ACME products in
social media. In other words, what is the buzz on its products?
❖ ACME wants to search all that is said about ACME products in social media sites, such as Twitter
and
❖ Facebook, and popular review sites, such as Amazon and ConsumerReports.
❖ It wants to answer questions such as these.
● Are people mentioning its products?
● What is being said?
● Are the products seen as good or bad?
● If people think an ACME product is bad, why?
● For example, are they complaining about the battery life of the bPhone, or the response
time in their bEbook?
❖ ACME can monitor the social media buzz using a simple process based on the three steps
Outlined :-
➢ 1. Collect raw text : This corresponds to Phase 1 and Phase 2 of the Data Analytic
Lifecycle. In this step, the Data Science team at ACME monitors websites for references
to specific products. The websites may include social media and review sites. The team
could interact with social network application programming interfaces (APIs) , process
data feeds, or scrape pages and use product names as keywords to get the raw data.
Regular expressions are commonly used in this case to identify text that matches certain
patterns. Additional filters can be applied to the raw data for a more focused study. For
example, only retrieving the reviews originating in New York instead of the entire United
States would allow ACME to conduct regional studies on its products. Generally, it is a
good practice to apply filters during the data collection phase. They can reduce I/O
workloads and minimize the storage
requirements.
➢ 2. Represent text. Convert each review into a suitable document representation with
proper indices, and build a corpus based on these indexed reviews. This step
corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
➢ 3. Compute the usefulness of each word in the reviews using methods such as TFIDF.
This and the following two steps correspond to Phases 3 through 5 of the Data Analytic
Lifecycle.
➢ 4. Categorize documents by topics. This can be achieved through topic models (such as
latent Dirichlet allocation).
➢ 5. Determine sentiments of the reviews. Identify whether the reviews are positive or
negative. Many product review sites provide ratings of a product with each review. If such
information is not available, techniques like sentiment analysis can be used on the
textual data to infer the underlying sentiments. People can express many emotions. To
keep the process simple, ACME considers sentiments as positive,neutral, or negative.
➢ 6. Review the results and gain greater insights (Section 9.8). This step corresponds to
Phase 5 and 6 of the Data Analytic Lifecycle. Marketing gathers the results from the
previous steps. Find out what exactly makes people love or hate a product. Use one or
more visualization techniques to report the findings. Test the soundness of the
conclusions and operationalize the findings if applicable.

ACME’s Text Analysis Process

Tokenization
Tokenization refers to breaking down the text into smaller units. It entails splitting paragraphs into
sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us have a look
at the two major kinds of tokenization that NLTK provides:

Work Tokenization

It involves breaking down the text into words.

"I study Machine Learning on GeeksforGeeks." will be word-tokenized as

['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

Sentence Tokenization

It involves breaking down the text into individual sentences.

Example:

"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"

will be sentence-tokenized as

['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

# Tokenization using NLTK

from nltk import word_tokenize, sent_tokenize
sent = "GeeksforGeeks is a great learning platform.\
It is one of the best for Computer Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))

Output:

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',

'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']

['GeeksforGeeks is a great learning platform.',

'It is one of the best for Computer Science students.']

Stemming and Lemmatization

When working with Natural Language, we are not much interested in the form of words – rather, we are
concerned with the meaning that the words intend to convey. Thus, we try to map every word of the
language to its root/base form. This process is called canonicalization.

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all
to their base form i.e. ‘play’.
Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

Stemming
Stemming generates the base word from the inflected word by removing the affixes of the word. It has a
set of predefined rules that govern the dropping of these affixes. It must be noted that stemmers might
not always result in semantically meaningful base words. Stemmers are faster and computationally less
expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer – one of the most widely used
stemmers:

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

Output:

play

We can see that all the variations of the word ‘play’ have been reduced to the same word – ‘play’. In this
case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in the case of
a stemmer.

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("Communication"))

Output:

commun

The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

Lemmatization
Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach
out to the base form of any word which will be meaningful in nature. The base from here is called the
Lemma.
Lemmatizers are slower and computationally more expensive than stemmers.

Example:

'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))

Output:

play

Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the word as a
function argument.

Also, lemmatizers always result in meaningful base words. Let us take the same example as we took in
the case for stemmers.

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))

Output:

Communication

Part of Speech Tagging

Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is
significant as it helps to give a better syntactic overview of a sentence.

Example:

"GeeksforGeeks is a Computer Science platform."

Let's see how NLTK's POS tagger will tag this sentence.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk import pos_tag

from nltk import word_tokenize

text = "GeeksforGeeks is a Computer Science platform."

tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
tags

Output:

[('GeeksforGeeks', 'NNP'),

('is', 'VBZ'),

('a', 'DT'),

('Computer', 'NNP'),

('Science', 'NNP'),

('platform', 'NN'),

('.', '.')]
TFIDF-
Term frequency-inverse document frequency (TF-IDF) is a natural language processing (NLP) technique
that’s used to measure the importance of different words in a sentence. It cancels out the incapabilities of
the bag of words technique, which is good for text classification or for helping a machine learning model
read words.
TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of
different words in a sentence. It’s useful in text classification and for helping a machine learning model
read words.

● Terminology
● Term frequency(TF)
● Document frequency (DF)
● Inverse document frequency (IDF)
● TF-IDF Implementation in Python

Below are the terms you’ll need to understand to create a TF-IDF model.

● t — term (word).
● d — document (set of words).
● N — count of corpus.
● corpus — the total document set.

What Is Term Frequency (TF) in TF-IDF?

Term Frequency (TF): Measures how often a word appears in a document. A higher frequency suggests
greater importance. If a term appears frequently in a document, it is likely relevant to the document’s
content.

Formula:

Limitations of TF Alone:

● TF does not account for the global importance of a term across the entire corpus.
● Common words like “the” or “and” may have high TF scores but are not meaningful in
distinguishing documents.

What Is Document Frequency in TF-IDF?

Document frequency (DF) measures the importance of a document in a whole set of corpus. This is very
similar to TF. The only difference is that TF is a frequency counter for a term t in document d, whereas DF
is the count of occurrences of term t in the document set N. In other words, DF is the number of
documents in which the word is present. We consider one occurrence if the term consists in the
document at least once. We don’t need to know the number of times the term is present.

DF Formula
df(t) = occurrence of t in documents

What Is Inverse Document Frequency (IDF) in TF-IDF?

Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents
while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be
meaningful and specific.

Formula:

The logarithm is used to dampen the effect of very large or very small values, ensuring the IDF score
scales appropriately.

It also helps balance the impact of terms that appear in extremely few or extremely many documents.

Limitations of IDF Alone:

● IDF does not consider how often a term appears within a specific document.
● A term might be rare across the corpus (high IDF) but irrelevant in a specific document (low TF).

How Does TF-IDF Work?

TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus.
There are many different variations of TF-IDF, but for now, let’s concentrate on the basic version.

TF-IDF Formula

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

Converting Text into vectors with TF-IDF : Example

To better grasp how TF-IDF works, let’s walk through a detailed example. Imagine we have a corpus (a
collection of documents) with three documents:

Document 1: “The cat sat on the mat.”

Document 2: “The dog played in the park.”

Document 3: “Cats and dogs are great pets.”

Our goal is to calculate the TF-IDF score for specific terms in these documents. Let’s focus on the word
“cat” and see how TF-IDF evaluates its importance.

Step 1: Calculate Term Frequency (TF)

For Document 1:

The word “cat” appears 1 time.

The total number of terms in Document 1 is 6 (“the”, “cat”, “sat”, “on”, “the”, “mat”).

So, TF(cat,Document 1) = 1/6

For Document 2:

The word “cat” does not appear.

So, TF(cat,Document 2)=0.

For Document 3:

The word “cat” appears 1 time (as “cats”).

The total number of terms in Document 3 is 6 (“cats”, “and”, “dogs”, “are”, “great”, “pets”).

So, TF(cat,Document 3)=1/6

● In Document 1 and Document 3, the word “cat” has the same TF score. This means it appears
with the same relative frequency in both documents.
● In Document 2, the TF score is 0 because the word “cat” does not appear.

Step 2: Calculate Inverse Document Frequency (IDF)

Total number of documents in the corpus (D): 3

Number of documents containing the term “cat”: 2 (Document 1 and Document 3).

So,

IDF(cat,D)=log 3/2 ≈0.176

The IDF score for “cat” is relatively low. This indicates that the word “cat” is not very rare in the corpus—it
appears in 2 out of 3 documents. If a term appeared in only 1 document, its IDF score would be higher,
indicating greater uniqueness.

Step 3: Calculate TF-IDF

The TF-IDF score for “cat” is 0.029 in Document 1 and Document 3, and 0 in Document 2 that reflects
both the frequency of the term in the document (TF) and its rarity across the corpus (IDF).
1. Collecting Raw Text
Raw text is unstructured data gathered from various sources. This is the first step in text mining, where
data is collected for further processing.

Sources of Raw Text:

Web Scraping: Extracting data from websites (e.g., news articles, product reviews).
Social Media Feeds: Data from Twitter, Facebook, Reddit, etc.
Customer Feedback & Reviews: Extracting text from product reviews, surveys, and complaints.
Emails & Chat Logs: Used in spam detection, fraud analysis, or customer support automation.
Legal & Government Documents: Extracting information from contracts, case laws, regulations.
Academic Papers & Reports: Used for research summarization and literature analysis.
Challenges in Collecting Raw Text:
Data Noise: Unnecessary information, such as advertisements or irrelevant text.
Different Formats: Data may be in PDFs, scanned images (OCR required), or plain text.
Data Privacy & Ethics: Ensuring proper data collection policies (e.g., GDPR compliance).

📌
Tools:

📌
Scrapy, BeautifulSoup (Python) for web scraping

📌
Twitter API, Facebook Graph API for social media data
OCR (Tesseract) for scanned documents

2. Representing Text (Text Preprocessing & Vectorization)

Once raw text is collected, it needs to be transformed into a machine-readable format before analysis.

Text Preprocessing Steps:

Tokenization – Splitting text into words or sentences.
Example:
"Text mining is powerful" → ['Text', 'mining', 'is', 'powerful']

Stopword Removal – Removing common words (e.g., "is", "the", "and") that do not add much meaning.
Example:
"Text mining is powerful" → ['Text', 'mining', 'powerful']

Lemmatization / Stemming – Reducing words to their root form.

Example:
"running", "ran", "runs" → "run"

Removing Punctuation & Special Characters – Cleaning unnecessary symbols.

Lowercasing – Converting text to lowercase to ensure uniformity.

Text Representation (Vectorization Techniques)

After preprocessing, text must be represented numerically.

1. Bag of Words (BoW)

Converts text into a word frequency matrix.
Example:
arduino

"Machine learning is fun. Learning is powerful."

Word Count
Machine 1
Learning 2
Fun 1
Powerful 1

2. TF-IDF (Term Frequency-Inverse Document Frequency)

Assigns importance to words based on their frequency across multiple documents.
Rare words get higher weights, while common words get lower scores.

3. Word Embeddings (Word2Vec, GloVe, BERT)

Converts words into dense vector representations that capture meaning and context.

📌
Tools:

📌
NLTK, SpaCy for text preprocessing

📌
Scikit-learn (CountVectorizer, TfidfVectorizer) for vectorization
Word2Vec, GloVe, BERT for advanced embeddings

3. Categorizing Documents by Topics (Topic Modeling & Text Classification)

Once text is processed, it can be classified into different topics.

Methods for Topic Categorization

1. Rule-Based Classification
Uses manually defined keywords or patterns to assign topics.
Example: Emails containing "invoice" or "payment" → Finance category.
2. Machine Learning-Based Text Classification
Supervised Learning: Uses labeled training data.
Example: Naïve Bayes, SVM, Random Forest, Deep Learning (LSTMs, Transformers)
Unsupervised Learning: Discovers topics without predefined labels.
Example: K-Means Clustering, Hierarchical Clustering
3. Topic Modeling (Unsupervised)
Identifies hidden themes in large text datasets.
LDA (Latent Dirichlet Allocation): Groups documents into probabilistic topics.
Example: News articles → Topics like "Politics," "Technology," "Sports."

📌
Tools:

📌
Scikit-learn, FastText, TensorFlow/Keras for text classification
Gensim (LDA, LSA, NMF) for topic modeling

4. Determining Sentiments (Sentiment Analysis)

Sentiment analysis determines whether text is positive, negative, or neutral.
Approaches to Sentiment Analysis
1. Lexicon-Based Approach
Uses predefined sentiment dictionaries (e.g., VADER, SentiWordNet).
Example:
mathematica
"This movie is amazing!" → Positive (+0.9 sentiment score)
2. Machine Learning Approach
Uses classifiers (e.g., Naïve Bayes, SVM, Logistic Regression) trained on sentiment-labeled data.
Example:
java
"The food was terrible." → Negative sentiment (-0.8 score)
3. Deep Learning Approach
Uses LSTMs, Transformers (BERT, RoBERTa) for advanced sentiment detection.

✅
Use Cases of Sentiment Analysis

✅
Customer Feedback – Analyzing user reviews (e.g., Amazon, Yelp).

✅
Stock Market Predictions – Using Twitter news to predict trends.
Brand Monitoring – Tracking brand reputation on social media.

📌
Tools:

📌
VADER (for short texts), TextBlob (Python) for lexicon-based methods
Scikit-learn, BERT, RoBERTa for ML-based sentiment analysis

5. Gaining Insights from Text Data

Extracting meaningful insights from text helps in decision-making and trend analysis.

Methods for Gaining Insights

1. Word Frequency Analysis
Identifies the most common words in a dataset.
Example: In customer feedback, "slow delivery" appears often → Key concern area.
2. Named Entity Recognition (NER)
Identifies key entities (e.g., People, Locations, Dates).
Example:
arduino

"Apple announced a new iPhone in California."

Entities Extracted:
Apple → Organization
iPhone → Product
California → Location
3. Text Clustering & Trend Analysis
Groups similar documents based on content similarity.
Used in news aggregation, customer segmentation.
4. Summarization Techniques
Extractive Summarization: Selects key sentences (e.g., TextRank, LexRank).
Abstractive Summarization: Generates new sentences (e.g., Transformers like T5, BART).

✅
Use Cases of Text Insights

✅
Legal & Compliance – Extracting key clauses from contracts.

✅
Healthcare & Medical – Summarizing patient records, medical literature.
Business Intelligence – Analyzing trends in customer complaints.

📌
Tools:
spaCy, NLTK, Gensim (TextRank), HuggingFace Transformers for insights extraction

05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
Text Mining & Applications in Social Media: by Anthony Yang
No ratings yet
Text Mining & Applications in Social Media: by Anthony Yang
30 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Social Media & Text Mining Guide
No ratings yet
Social Media & Text Mining Guide
27 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
Text Mining for Business Insights
No ratings yet
Text Mining for Business Insights
10 pages
Business Intelligence and Anlytics UNIT 2
No ratings yet
Business Intelligence and Anlytics UNIT 2
35 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Advanced Text & Web Analytics
No ratings yet
Advanced Text & Web Analytics
4 pages
Web and Text Mining
No ratings yet
Web and Text Mining
6 pages
DMPPT 557
No ratings yet
DMPPT 557
14 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Module 4
No ratings yet
Module 4
63 pages
Bda Mod5
No ratings yet
Bda Mod5
20 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Paper News Text Summaraizaton
No ratings yet
Paper News Text Summaraizaton
8 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
Assignmentt
No ratings yet
Assignmentt
22 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
IMTC634 - Data Science - Chapter 7
No ratings yet
IMTC634 - Data Science - Chapter 7
24 pages
Social Network Analysis
No ratings yet
Social Network Analysis
18 pages
ETB Text Analytics Using Machine Learning - 20-12-24
No ratings yet
ETB Text Analytics Using Machine Learning - 20-12-24
38 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Text Mining for Data Insights
No ratings yet
Text Mining for Data Insights
12 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
Text Mining in Data Mining Guide
No ratings yet
Text Mining in Data Mining Guide
18 pages
Chapter 12
No ratings yet
Chapter 12
20 pages
Unit 1
No ratings yet
Unit 1
8 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
Social Media Text Mining for BI
No ratings yet
Social Media Text Mining for BI
5 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
CHP 5
No ratings yet
CHP 5
57 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
An Introduction To Data Mining With Open-Source Technologies
No ratings yet
An Introduction To Data Mining With Open-Source Technologies
43 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
SCA - Module 7
No ratings yet
SCA - Module 7
47 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Text Mining
No ratings yet
Text Mining
3 pages
Data Warehouse and Mining UNIT 6
No ratings yet
Data Warehouse and Mining UNIT 6
18 pages
NLP 2
No ratings yet
NLP 2
86 pages
SOCIAL MEDIA AND WEB MINING - Unit-I-III
No ratings yet
SOCIAL MEDIA AND WEB MINING - Unit-I-III
41 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Literature Review Report
No ratings yet
Literature Review Report
24 pages
Unit 5 DM
No ratings yet
Unit 5 DM
11 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Data Analytics
No ratings yet
Data Analytics
24 pages
NCSPCN 12 CRP
No ratings yet
NCSPCN 12 CRP
3 pages
A Brief Review On Sentiment Analysis
No ratings yet
A Brief Review On Sentiment Analysis
5 pages
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
No ratings yet
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
91 pages
SMA Unit 1
No ratings yet
SMA Unit 1
20 pages
QB All Modules-Astma
No ratings yet
QB All Modules-Astma
5 pages
Data Science & Text Mining Guide
No ratings yet
Data Science & Text Mining Guide
39 pages
E-Panchnama User Mannual Updated
No ratings yet
E-Panchnama User Mannual Updated
101 pages
Cost of Quality
No ratings yet
Cost of Quality
2 pages
Tintinalli Medicina de Urgencias
50% (2)
Tintinalli Medicina de Urgencias
3 pages
Exercise - Using Interrupts - Flowcode Help
No ratings yet
Exercise - Using Interrupts - Flowcode Help
6 pages
IBM 3592 Advanced Data Cartridge: High-Capacity Tape Media Engineered For The Enterprise
No ratings yet
IBM 3592 Advanced Data Cartridge: High-Capacity Tape Media Engineered For The Enterprise
4 pages
Silicon Labs CP210x USB-to-UART Setup Guide
No ratings yet
Silicon Labs CP210x USB-to-UART Setup Guide
7 pages
Ravikiran CyberArk Resume
No ratings yet
Ravikiran CyberArk Resume
6 pages
Lecture 05.4 SQL StoredProcedures 32
No ratings yet
Lecture 05.4 SQL StoredProcedures 32
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Snooker King System Installation and Wiring Manual: Company Address
No ratings yet
Snooker King System Installation and Wiring Manual: Company Address
4 pages
Sample Sharebility Academy Ict Edutech Policy
No ratings yet
Sample Sharebility Academy Ict Edutech Policy
32 pages
Dirbuster Big List
0% (2)
Dirbuster Big List
342 pages
Tech Enthusiast's Career Journey
No ratings yet
Tech Enthusiast's Career Journey
2 pages
B Tech (CSE (Data Science) (5th Sem) - 250308 - 195420
No ratings yet
B Tech (CSE (Data Science) (5th Sem) - 250308 - 195420
14 pages
Log
No ratings yet
Log
33 pages
Autocad Electrical Course Outline
No ratings yet
Autocad Electrical Course Outline
7 pages
Java Chapter One Notes
No ratings yet
Java Chapter One Notes
6 pages
10.0 NeoStampa Delta-Manual Pack
No ratings yet
10.0 NeoStampa Delta-Manual Pack
187 pages
"Introduction of Project": Library Management System
No ratings yet
"Introduction of Project": Library Management System
40 pages
The Age of Distraction by Leo Babauta
No ratings yet
The Age of Distraction by Leo Babauta
1 page
VBScript Interview Questions
100% (1)
VBScript Interview Questions
4 pages
Computer Generations: 1940 - 1956: First Generation - Vacuum Tubes
No ratings yet
Computer Generations: 1940 - 1956: First Generation - Vacuum Tubes
3 pages
AVR I2C Interface Tutorial
No ratings yet
AVR I2C Interface Tutorial
11 pages
Accident Detection System Report
No ratings yet
Accident Detection System Report
40 pages
Cse543 Web Security 23
No ratings yet
Cse543 Web Security 23
39 pages
CE306 Computer Programming and Computational Techniques
No ratings yet
CE306 Computer Programming and Computational Techniques
2 pages
High Availability and DR Test Report: T24 Architecture With JMS Connectivity Oracle Stack
No ratings yet
High Availability and DR Test Report: T24 Architecture With JMS Connectivity Oracle Stack
59 pages
Daily Expense Tracker
No ratings yet
Daily Expense Tracker
48 pages
Educational Website Business Plan
100% (1)
Educational Website Business Plan
36 pages
SAMUKA Under Water ROV
No ratings yet
SAMUKA Under Water ROV
38 pages

DAV Module 4

Uploaded by

DAV Module 4

Uploaded by

MODULE 4

Applications and Use Cases of Text Mining

1. Extracting Meaning from Unstructured Text

A Text Analysis Example

ACME’s Text Analysis Process

It involves breaking down the text into words.

"I study Machine Learning on GeeksforGeeks." will be word-tokenized as

['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

It involves breaking down the text into individual sentences.

"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"

['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

# Tokenization using NLTK

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',

['GeeksforGeeks is a great learning platform.',

'It is one of the best for Computer Science students.']

Stemming and Lemmatization

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

from nltk.stem import PorterStemmer

'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk.stem import WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer

Part of Speech Tagging

"GeeksforGeeks is a Computer Science platform."

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk import pos_tag

text = "GeeksforGeeks is a Computer Science platform."

What Is Term Frequency (TF) in TF-IDF?

What Is Document Frequency in TF-IDF?

What Is Inverse Document Frequency (IDF) in TF-IDF?

Limitations of IDF Alone:

How Does TF-IDF Work?

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

Converting Text into vectors with TF-IDF : Example

Document 1: “The cat sat on the mat.”

Document 2: “The dog played in the park.”

Step 1: Calculate Term Frequency (TF)

The word “cat” appears 1 time.

So, TF(cat,Document 1) = 1/6

The word “cat” does not appear.

So, TF(cat,Document 2)=0.

The word “cat” appears 1 time (as “cats”).

So, TF(cat,Document 3)=1/6

Step 2: Calculate Inverse Document Frequency (IDF)

Total number of documents in the corpus (D): 3

IDF(cat,D)=log 3/2 ≈0.176

Step 3: Calculate TF-IDF

Sources of Raw Text:

2. Representing Text (Text Preprocessing & Vectorization)

Text Preprocessing Steps:

Lemmatization / Stemming – Reducing words to their root form.

Removing Punctuation & Special Characters – Cleaning unnecessary symbols.

Lowercasing – Converting text to lowercase to ensure uniformity.

Text Representation (Vectorization Techniques)

1. Bag of Words (BoW)

"Machine learning is fun. Learning is powerful."

2. TF-IDF (Term Frequency-Inverse Document Frequency)

3. Word Embeddings (Word2Vec, GloVe, BERT)

3. Categorizing Documents by Topics (Topic Modeling & Text Classification)

Methods for Topic Categorization

4. Determining Sentiments (Sentiment Analysis)

5. Gaining Insights from Text Data

Methods for Gaining Insights

"Apple announced a new iPhone in California."

You might also like