Lecture 1: Introduction to Text Analytics
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Text Analytics: Overview
02 TA Process 1: Collection & Preprocessing
03 TA Process 2: Transformation
04 TA Process 3: Dimensionality Reduction
05 TA Process 4: Learning & Evaluation
Decide What to Mine Witte (2006)
• From a wide range of text sources…
Decide What to Mine Witte (2006)
• From a wide range of text sources…
Decide What to Mine
• Open datasets/repositories
✓ The best 25 datasets for NLP (2018.06.07)
▪ https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/
✓ Alphabetical list of free/public domain datasets with text data for use in Natural
Language Processing (NLP)
▪ https://github.com/niderhoff/nlp-datasets
✓ 50 Free Machine Learning Datasets: Natural Language Processing
▪ https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-
processing-d88fb9c5c8da
✓ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
▪ https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-
datasets/
Text Preprocessing Level 0: Text
• Remove unnecessary information from the collected data
• Remove figures, advertisements, html syntax,
hyperlinks, etc.
Text Preprocessing Level 0: Text
• Do not remove meta-data, which
contains significant information on
the text
✓ Ex) Newspaper article: author, date,
category, language, newspaper, etc.
• Meta-data can be used for further
analysis
✓ Target class of a document
✓ Time series analysis
Text Preprocessing Level 1: Sentence
• Correct sentence boundary is also important
✓ For many downstream analysis tasks
▪ POS-Tagger maximize probabilities of tags within a sentence
▪ Summarization systems rely on correct detection of sentence
Text Preprocessing Level 1: Sentence
• Sentence Splitting (페이지 링크)
Text Preprocessing Level 1: Sentence
• Sentence Splitting with Rule-based Model
Text Preprocessing Level 2: Token
• Extracting meaningful (worth being analyzed) tokens (word, number, space, etc.)
from a text is not an easy task.
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?
✓ What to do with hyphens?
▪ database vs. data-base vs. data base
✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?
✓ Some languages do not use whitespace (e.g., Chinese)
• Consistent tokenization is important for all later processing steps.
Text Preprocessing Level 2: Token
• Power distribution in word frequencies
✓ It is not true that more frequently appeared words (tokens) are more important for
text mining tasks.
100 common words in the Oxford English Corpus Word frequency distribution in Wikipedia
http://en.wikipedia.org/wiki/Most_common_words_in_English http://upload.wikimedia.org/wikipedia/commons/b/b9/Wikipedia-n-zipf.png
Text Preprocessing Level 2: Token
• Stop-words
✓ Words that do not carry any information
▪ Mainly functional role
▪ Usually remove them to help the machine learning algorithms to perform better
✓ Natural language dependent
▪ English: a, about, above, across, after, again, against, all, also, etc.
▪ 한국어: …습니다, …로서(써), …를 등
[Original text] [After removing stop words]
http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf
Text Preprocessing Level 2: Token
• Stemming
✓ Different forms of the same word are usually problematic for text data analysis,
because they have different spelling and similar meaning
▪ Learns, learned, learning, …
✓ Stemming is a process of transforming a word into its stem (normalized form)
▪ In English: Porter2 stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html)
▪ In Korean, 꼬꼬마 형태소 분석기 (http://kkma.snu.ac.kr/documents/)
Text Preprocessing Level 2: Token
• Lemmatization
✓ Although stemming just finds any base form, which does not even need to be a word
in the language, but lemmatization finds the actual root of a word
Word Stemming Lemmatization
Love Lov Love
Loves Lov Love
Loved Lov Love
Loving Lov Love
Innovation Innovat Innovation
Innovations Innovat Innovation
Innovate Innovat Innovate
Innovates Innovat Innovate
Innovative Innovat Innovative
AGENDA
01 Text Mining Overview
02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Text Transformation
• Document representation
✓ Bag-of-words: simplifying representation method for documents where a text is
represented in a vector of an unordered collection of words
S1: Jon likes to watch movies. Mary likes too.
S2: John also likes to watch football game.
Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Movies 1 0
Also 0 1
Football 0 1
Games 0 1
Mary 1 0
too 1 0
Text Transformation
• Word Weighting
✓ Each word is represented as a separate variable with a numeric weight
✓ Term frequency and inverse document frequency (TF-IDF)
▪ tf(w): term frequency (number of word occurrences in a document)
▪ df(w): number of documents containing the word (number of documents containing the
word)
N
TF − IDF ( w) = tf ( w) log
df ( w)
The word is more important if it appears The word is more important if it appears
several times in a target document in less documents
Text Transformation
• Word weighting (cont’): example
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0.35
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Text Transformation
• One-hot-vector representation
✓ The most simple & intuitive representation
✓ Can make a vector representation, but similarities between words cannot be
preserved.
Text Transformation
• Word vectors: distributed representation
✓ A parameterized function mapping words in some language to a certain dimensional
vectors
• Interesting feature of word embedding
✓ Semantic relationship between words can be preserved
Text Transformation
• Word vectors: distributed representation
http://nlp.stanford.edu/projects/glove/
Text Transformation
• Pre-trained Word Models
• Pre-trained Language Models
AGENDA
01 Text Mining Overview
02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Feature Selection/Extraction
• Feature subset selection
✓ Select only the best features for further analysis
▪ The most frequent
▪ The most informative relative to the all class values, …
✓ Scoring methods for individual feature (for supervised learning tasks)
▪ Information gain:
▪ Cross-entropy:
▪ Mutual information:
▪ Weight of evidence:
▪ Odds ratio:
▪ Frequency:
Feature Selection/Extraction
• Feature subset extraction
✓ Feature extraction: construct a set of variables that preserve the information of the
original data by combining them in a linear/non-linear form
✓ Latent Semantic Analysis (LSA) that is based on singular value decomposition is
commonly used mxr rxr rxn
▪ Rectangular matrix A(m x n) can be decomposed as A = UΣV T
▪ All singular vectors of the matrices U and V are orthogonal UT U = V T V = I
▪ Eigenvalues of the matrix ∑ is positive and sorted in a descending order
Feature Selection/Extraction Lee (2010)
• SVD in Text Mining (Latent Semantic Analysis/Indexing)
✓ Step 1) Using SVD, a term-document matrix D is reduced to Dk
D Dk = U k Σ k Vk T
✓ Step 2) Multiply the transpose of the matrix Uk
U Tk Dk = Σ k Vk T
✓ Apply data mining algorithms to the matrix obtained in the Step 2)
Feature Selection/Extraction
• LSA
https://www.quora.com/How-is-LSA-used-for-a-text-document-clustering
Feature Selection/Extraction
• LSA example
✓ Data: 41,717 abstracts of the research projects that were funded by National Science
Foundation (NSF) between 1990 and 2003
✓ Top 10 positive and negative keywords for each SVD
Feature Selection/Extraction
• LSA example
✓ Visualize the project in the reduced 2-D space
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Latent Dirichlet Allocation (LDA)
• 단어(w)는 특정 토픽들 z로부터 생성됨
• 해당 문서가 어떤 토픽 비율(topic proportion, θ)을 가질 것인지는 파라미터가 α인 Dirichlet
distribution에 의해 결정됨
• w만 실제로 관찰할 수 있는 값이고 θ, z, Φ는 숨겨진 값임
• 문서 생성 프로세스는 먼저 다항분포 θ로부터 토픽이 나타날 확률 θ를 추출하고 이를
바탕으로 단어가 나타날 확률인 w를 산출함
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Two outputs of topic modeling
▪ Per-document topic proportion
▪ Per-topic word distribution
(a) Per-document topic proportions (𝜃𝑑 ) (b) Per-topic word distributions (𝜙𝑘 )
Topic 1 Topic 2 Topic 3 … Topic K Sum Topic 1 Topic 2 Topic 3 … Topic K
Doc 1 0.20 0.50 0.10 … 0.10 1 word 1 0.01 0.05 0.05 … 0.10
Doc 2 0.50 0.02 0.01 … 0.40 1 word 2 0.02 0.02 0.01 … 0.03
Doc 3 0.05 0.12 0.48 … 0.15 1 word 3 0.05 0.12 0.08 … 0.02
… … … … … … 1 … … … … … …
Doc N 0.14 0.25 0.33 … 0.14 1 word V 0.04 0.01 0.03 … 0.07
Sum 1 1 1 1 1
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
AGENDA
01 Text Mining Overview
02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Similarity Between Documents
• Document similarity
✓ Use the cosine similarity rather than Euclidean distance
▪ Which two documents are more similar?
Doc. Word 1 Word 2 Word 3
Document 1 1 1 1
Document 2 3 3 3
Document 3 0 2 0
Sim( D1 , D2 ) =
x x i 1i 2 i
x j
2
j x
k k
2
Learning Task 1: Classification
• Document categorization (classification)
✓ Automatically classify a document into one of the pre-defined categories
Machine Learning Algorithms
Labeled Unlabeled
documents documents
Document
Category
Learning Task 1: Classification
• Spam filtering
Raw Data Features Model
Vector space model
(Bag of words)
Domain knowledge-
based phrases
(“Free money”, “!!!”)
Meta-data
(sender, mailing list, etc.)
Learning Task 1: Classification
• Sentiment Analysis
http://www.crowdsource.com/solutions/content-moderation/sentiment-analysis/
Learning Task 1: Classification Socher et al. (2013)
• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 1: Classification Socher et al. (2013)
• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Have a top level view of the topics in the corpora
✓ See relationships between topics
✓ Understand better what’s going on
Raw Data Features
• 8,850 articles from 11 Journals for the recent 10 years • 50 topics from Latent Dirichlet Allocation (LDA)
• 21,434 terms after preprocessing
Learning Task 2: Clustering
• Document Clustering & Visualization
Keywords association Journal/Topic Clustering
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Themescape: contents maps from Thomson innovation full text patent data
Learning Task 3: Information Extraction/Retrieval
Yang et al. (2013)
• Information extraction/retrieval
✓ Find useful information from text databases
✓ Examples: Question Answering
https://rajpurkar.github.io/SQuAD-
explorer/explore/1.1/dev/Super_Bowl_50.html?model=r-
net+%20(ensemble)%20(Microsoft%20Research%20Asia)&version=1.1
Learning Task 3: Information Extraction/Retrieval
• Topic Modeling
✓ A suite of algorithms that aim to
discover and annotate large archives of
documents with thematic information
✓ Statistical methods that analyze the
words of the original texts to discover
▪ the themes that run through them
▪ how themes are connected to each
other
▪ how they change over time
https://dhs.stanford.edu/algorithmic-literacy/my-definition-of-topic-modeling/
Learning Task 3: Information Extraction/Retrieval
• Latent Dirichlet Allocation (LDA)
• Words (w) are statistically generated from the topics Z
• Topic proportion for a document (θd) is determined by the Dirichlet distribution with the parameter α
• We can only observe W; θ, z, Φ are latent variables (hidden, cannot be observed)
• Document generation process: (1) Estimate the topic proportions θ, (2) estimate the word probability w from θ