0% found this document useful (0 votes)

56 views48 pages

01 - Introduction To Text Analytics - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views48 pages

01 - Introduction To Text Analytics - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Lecture 1: Introduction to Text Analytics

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Text Analytics: Overview

02 TA Process 1: Collection & Preprocessing
03 TA Process 2: Transformation
04 TA Process 3: Dimensionality Reduction
05 TA Process 4: Learning & Evaluation
Decide What to Mine Witte (2006)

• From a wide range of text sources…

Decide What to Mine Witte (2006)

• From a wide range of text sources…

Decide What to Mine
• Open datasets/repositories
✓ The best 25 datasets for NLP (2018.06.07)
▪ https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/

✓ Alphabetical list of free/public domain datasets with text data for use in Natural
Language Processing (NLP)
▪ https://github.com/niderhoff/nlp-datasets

✓ 50 Free Machine Learning Datasets: Natural Language Processing

▪ https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-
processing-d88fb9c5c8da

✓ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
▪ https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-
datasets/
Text Preprocessing Level 0: Text
• Remove unnecessary information from the collected data

• Remove figures, advertisements, html syntax,

hyperlinks, etc.
Text Preprocessing Level 0: Text
• Do not remove meta-data, which
contains significant information on
the text
✓ Ex) Newspaper article: author, date,
category, language, newspaper, etc.

• Meta-data can be used for further

analysis
✓ Target class of a document
✓ Time series analysis
Text Preprocessing Level 1: Sentence
• Correct sentence boundary is also important
✓ For many downstream analysis tasks
▪ POS-Tagger maximize probabilities of tags within a sentence
▪ Summarization systems rely on correct detection of sentence
Text Preprocessing Level 1: Sentence
• Sentence Splitting (페이지 링크)
Text Preprocessing Level 1: Sentence
• Sentence Splitting with Rule-based Model
Text Preprocessing Level 2: Token
• Extracting meaningful (worth being analyzed) tokens (word, number, space, etc.)
from a text is not an easy task.
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?

✓ What to do with hyphens?

▪ database vs. data-base vs. data base

✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?

✓ Some languages do not use whitespace (e.g., Chinese)

• Consistent tokenization is important for all later processing steps.

Text Preprocessing Level 2: Token
• Power distribution in word frequencies
✓ It is not true that more frequently appeared words (tokens) are more important for
text mining tasks.

100 common words in the Oxford English Corpus Word frequency distribution in Wikipedia

http://en.wikipedia.org/wiki/Most_common_words_in_English http://upload.wikimedia.org/wikipedia/commons/b/b9/Wikipedia-n-zipf.png
Text Preprocessing Level 2: Token
• Stop-words
✓ Words that do not carry any information
▪ Mainly functional role
▪ Usually remove them to help the machine learning algorithms to perform better

✓ Natural language dependent

▪ English: a, about, above, across, after, again, against, all, also, etc.
▪ 한국어: …습니다, …로서(써), …를 등

[Original text] [After removing stop words]

http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf
Text Preprocessing Level 2: Token
• Stemming
✓ Different forms of the same word are usually problematic for text data analysis,
because they have different spelling and similar meaning
▪ Learns, learned, learning, …

✓ Stemming is a process of transforming a word into its stem (normalized form)

▪ In English: Porter2 stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html)
▪ In Korean, 꼬꼬마 형태소 분석기 (http://kkma.snu.ac.kr/documents/)
Text Preprocessing Level 2: Token
• Lemmatization
✓ Although stemming just finds any base form, which does not even need to be a word
in the language, but lemmatization finds the actual root of a word

Word Stemming Lemmatization

Love Lov Love

Loves Lov Love

Loved Lov Love

Loving Lov Love

Innovation Innovat Innovation

Innovations Innovat Innovation

Innovate Innovat Innovate

Innovates Innovat Innovate

Innovative Innovat Innovative

AGENDA

01 Text Mining Overview

02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Text Transformation
• Document representation
✓ Bag-of-words: simplifying representation method for documents where a text is
represented in a vector of an unordered collection of words
S1: Jon likes to watch movies. Mary likes too.
S2: John also likes to watch football game.

Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Movies 1 0
Also 0 1
Football 0 1
Games 0 1
Mary 1 0
too 1 0
Text Transformation
• Word Weighting
✓ Each word is represented as a separate variable with a numeric weight
✓ Term frequency and inverse document frequency (TF-IDF)
▪ tf(w): term frequency (number of word occurrences in a document)
▪ df(w): number of documents containing the word (number of documents containing the
word)
 N 
TF − IDF ( w) = tf ( w)  log 
 df ( w) 

The word is more important if it appears The word is more important if it appears
several times in a target document in less documents
Text Transformation
• Word weighting (cont’): example
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0

Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0.35
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Text Transformation
• One-hot-vector representation
✓ The most simple & intuitive representation

✓ Can make a vector representation, but similarities between words cannot be

preserved.
Text Transformation
• Word vectors: distributed representation
✓ A parameterized function mapping words in some language to a certain dimensional
vectors

• Interesting feature of word embedding

✓ Semantic relationship between words can be preserved
Text Transformation
• Word vectors: distributed representation

http://nlp.stanford.edu/projects/glove/
Text Transformation
• Pre-trained Word Models

• Pre-trained Language Models

AGENDA

01 Text Mining Overview

02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Feature Selection/Extraction
• Feature subset selection
✓ Select only the best features for further analysis
▪ The most frequent

▪ The most informative relative to the all class values, …

✓ Scoring methods for individual feature (for supervised learning tasks)

▪ Information gain:

▪ Cross-entropy:

▪ Mutual information:

▪ Weight of evidence:

▪ Odds ratio:

▪ Frequency:
Feature Selection/Extraction
• Feature subset extraction
✓ Feature extraction: construct a set of variables that preserve the information of the
original data by combining them in a linear/non-linear form
✓ Latent Semantic Analysis (LSA) that is based on singular value decomposition is
commonly used mxr rxr rxn

▪ Rectangular matrix A(m x n) can be decomposed as A = UΣV T

▪ All singular vectors of the matrices U and V are orthogonal UT U = V T V = I

▪ Eigenvalues of the matrix ∑ is positive and sorted in a descending order

Feature Selection/Extraction Lee (2010)

• SVD in Text Mining (Latent Semantic Analysis/Indexing)

✓ Step 1) Using SVD, a term-document matrix D is reduced to Dk

D  Dk = U k Σ k Vk T
✓ Step 2) Multiply the transpose of the matrix Uk

U Tk Dk = Σ k Vk T
✓ Apply data mining algorithms to the matrix obtained in the Step 2)
Feature Selection/Extraction
• LSA

https://www.quora.com/How-is-LSA-used-for-a-text-document-clustering
Feature Selection/Extraction
• LSA example
✓ Data: 41,717 abstracts of the research projects that were funded by National Science
Foundation (NSF) between 1990 and 2003
✓ Top 10 positive and negative keywords for each SVD
Feature Selection/Extraction
• LSA example
✓ Visualize the project in the reduced 2-D space
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Latent Dirichlet Allocation (LDA)

• 단어(w)는 특정 토픽들 z로부터 생성됨

• 해당 문서가 어떤 토픽 비율(topic proportion, θ)을 가질 것인지는 파라미터가 α인 Dirichlet
distribution에 의해 결정됨
• w만 실제로 관찰할 수 있는 값이고 θ, z, Φ는 숨겨진 값임
• 문서 생성 프로세스는 먼저 다항분포 θ로부터 토픽이 나타날 확률 θ를 추출하고 이를
바탕으로 단어가 나타날 확률인 w를 산출함
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Two outputs of topic modeling
▪ Per-document topic proportion
▪ Per-topic word distribution

(a) Per-document topic proportions (𝜃𝑑 ) (b) Per-topic word distributions (𝜙𝑘 )

Topic 1 Topic 2 Topic 3 … Topic K Sum Topic 1 Topic 2 Topic 3 … Topic K

Doc 1 0.20 0.50 0.10 … 0.10 1 word 1 0.01 0.05 0.05 … 0.10
Doc 2 0.50 0.02 0.01 … 0.40 1 word 2 0.02 0.02 0.01 … 0.03
Doc 3 0.05 0.12 0.48 … 0.15 1 word 3 0.05 0.12 0.08 … 0.02
… … … … … … 1 … … … … … …
Doc N 0.14 0.25 0.33 … 0.14 1 word V 0.04 0.01 0.03 … 0.07
Sum 1 1 1 1 1
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
AGENDA

01 Text Mining Overview

02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Similarity Between Documents
• Document similarity
✓ Use the cosine similarity rather than Euclidean distance
▪ Which two documents are more similar?

Doc. Word 1 Word 2 Word 3

Document 1 1 1 1
Document 2 3 3 3
Document 3 0 2 0

Sim( D1 , D2 ) =
x x i 1i 2 i

x  j
2
j x
k k
2
Learning Task 1: Classification
• Document categorization (classification)
✓ Automatically classify a document into one of the pre-defined categories

Machine Learning Algorithms

Labeled Unlabeled
documents documents

Document
Category
Learning Task 1: Classification
• Spam filtering
Raw Data Features Model

Vector space model

(Bag of words)

Domain knowledge-
based phrases
(“Free money”, “!!!”)

Meta-data
(sender, mailing list, etc.)
Learning Task 1: Classification
• Sentiment Analysis

http://www.crowdsource.com/solutions/content-moderation/sentiment-analysis/
Learning Task 1: Classification Socher et al. (2013)

• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 1: Classification Socher et al. (2013)

• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Have a top level view of the topics in the corpora
✓ See relationships between topics
✓ Understand better what’s going on

Raw Data Features

• 8,850 articles from 11 Journals for the recent 10 years • 50 topics from Latent Dirichlet Allocation (LDA)
• 21,434 terms after preprocessing
Learning Task 2: Clustering
• Document Clustering & Visualization
Keywords association Journal/Topic Clustering
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Themescape: contents maps from Thomson innovation full text patent data
Learning Task 3: Information Extraction/Retrieval
Yang et al. (2013)

• Information extraction/retrieval
✓ Find useful information from text databases
✓ Examples: Question Answering

https://rajpurkar.github.io/SQuAD-
explorer/explore/1.1/dev/Super_Bowl_50.html?model=r-
net+%20(ensemble)%20(Microsoft%20Research%20Asia)&version=1.1
Learning Task 3: Information Extraction/Retrieval
• Topic Modeling
✓ A suite of algorithms that aim to
discover and annotate large archives of
documents with thematic information
✓ Statistical methods that analyze the
words of the original texts to discover
▪ the themes that run through them
▪ how themes are connected to each
other
▪ how they change over time

https://dhs.stanford.edu/algorithmic-literacy/my-definition-of-topic-modeling/
Learning Task 3: Information Extraction/Retrieval
• Latent Dirichlet Allocation (LDA)

• Words (w) are statistically generated from the topics Z

• Topic proportion for a document (θd) is determined by the Dirichlet distribution with the parameter α
• We can only observe W; θ, z, Φ are latent variables (hidden, cannot be observed)
• Document generation process: (1) Estimate the topic proportions θ, (2) estimate the word probability w from θ

Lect 5
No ratings yet
Lect 5
40 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Module 3
No ratings yet
Module 3
40 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Week 12
No ratings yet
Week 12
19 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Text Mining
No ratings yet
Text Mining
34 pages
Statistical NLP: Models & Applications
No ratings yet
Statistical NLP: Models & Applications
43 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Text Mining
No ratings yet
Text Mining
62 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Intro To Statistical NLP
No ratings yet
Intro To Statistical NLP
57 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Text Mining Preprocessing Guide
No ratings yet
Text Mining Preprocessing Guide
7 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Exp 7
No ratings yet
Exp 7
9 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Module III
No ratings yet
Module III
42 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Text Mining
No ratings yet
Text Mining
25 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Text Analysis with TF-IDF and NLTK
No ratings yet
Text Analysis with TF-IDF and NLTK
10 pages
Unit 5
No ratings yet
Unit 5
8 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Pipeline
No ratings yet
Pipeline
9 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Statistical Language Processing
No ratings yet
Statistical Language Processing
32 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Text
No ratings yet
Text
102 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
How To Create A Complex Testbench in A Couple of Hours
No ratings yet
How To Create A Complex Testbench in A Couple of Hours
70 pages
Verification of An ARM Cortex-M3 Based SoC Using UVM
No ratings yet
Verification of An ARM Cortex-M3 Based SoC Using UVM
7 pages
01 - Introduction To Text Analytics - Part1
No ratings yet
01 - Introduction To Text Analytics - Part1
64 pages
05 - Text Representation II - Distributed Representation - GloVe
No ratings yet
05 - Text Representation II - Distributed Representation - GloVe
23 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
02 - Text Preprocessing - Part3
No ratings yet
02 - Text Preprocessing - Part3
22 pages
"Liberal Peacebuilding Model Analysis"
No ratings yet
"Liberal Peacebuilding Model Analysis"
30 pages
Paper 2
No ratings yet
Paper 2
29 pages
Metis Bootcamp Curriculum
No ratings yet
Metis Bootcamp Curriculum
18 pages
Anjum, Nasreen Et Al. (2025) Cyber-Biosecurity Challenges in Next-Generation Sequencing A Comprehensive Analysis of Emerging Threat Vectors
No ratings yet
Anjum, Nasreen Et Al. (2025) Cyber-Biosecurity Challenges in Next-Generation Sequencing A Comprehensive Analysis of Emerging Threat Vectors
8 pages
The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation
No ratings yet
The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation
7 pages
ChatGPT in Education
100% (2)
ChatGPT in Education
14 pages
Topic Modeling Tourism
No ratings yet
Topic Modeling Tourism
6 pages
76.sentiment Analysis in Emergency Calls For Exploring Natural Language Processing For Enhanced Police Dispatch Services
No ratings yet
76.sentiment Analysis in Emergency Calls For Exploring Natural Language Processing For Enhanced Police Dispatch Services
2 pages
Review On 100 Renewable Energy System AnalysesA Bibliometric Perspective
No ratings yet
Review On 100 Renewable Energy System AnalysesA Bibliometric Perspective
43 pages
A Document Exploring System On Lda Topic Model For Wikipedia Articles
No ratings yet
A Document Exploring System On Lda Topic Model For Wikipedia Articles
13 pages
Text Mining Problems-4
No ratings yet
Text Mining Problems-4
59 pages
Ber Topic
No ratings yet
Ber Topic
10 pages
Orange3 Text Mining Guide
No ratings yet
Orange3 Text Mining Guide
53 pages
Sievert, C., & Shirley, K. E. LDAvis. A Method For Visualizing and Interpreting Topics
No ratings yet
Sievert, C., & Shirley, K. E. LDAvis. A Method For Visualizing and Interpreting Topics
8 pages
NLP Record
No ratings yet
NLP Record
15 pages
MLOps Brochure BITS
No ratings yet
MLOps Brochure BITS
27 pages
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
No ratings yet
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
16 pages
China's ESG Public Perceptions
No ratings yet
China's ESG Public Perceptions
16 pages
A Data-Driven Approach For Incident Handling in DevOps
No ratings yet
A Data-Driven Approach For Incident Handling in DevOps
3 pages
Brijesh Ds ML CV 2025
No ratings yet
Brijesh Ds ML CV 2025
2 pages
Arihant Academy SEO & Marketing Strategy
No ratings yet
Arihant Academy SEO & Marketing Strategy
31 pages
Harnessing AI For Advancing Pathogenic Microbiolog
No ratings yet
Harnessing AI For Advancing Pathogenic Microbiolog
15 pages
Kaplan 2014
No ratings yet
Kaplan 2014
23 pages
1 s2.0 S1319157824001691 Main
No ratings yet
1 s2.0 S1319157824001691 Main
14 pages
Topic Modeling P.P.T
No ratings yet
Topic Modeling P.P.T
27 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Facilitators and Barriers of Artificial Intelligence Adoption in Business - Insights From Opinions Using Big Data Analytics
No ratings yet
Facilitators and Barriers of Artificial Intelligence Adoption in Business - Insights From Opinions Using Big Data Analytics
24 pages
A Brief History of Risk
No ratings yet
A Brief History of Risk
11 pages
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
No ratings yet
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
14 pages
Unit 3
No ratings yet
Unit 3
3 pages