Lecture 2: Text Preprocessing
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Lexical Analysis
• Goals of lexical analysis
✓ Convert a sequence of characters into a sequence of tokens, i.e., meaningful
character strings.
▪ In natural language processing, morpheme is a basic unit
▪ In text mining, word is commonly used as a basic unit for analysis
• Process of lexical analysis
✓ Tokenizing
✓ Part-of-Speech (POS) tagging
✓ Additional analysis: named entity recognition (NER), noun phrase recognition,
sentence split, chunking, etc.
Lexical Analysis Hirschberg and Manning (2015)
• Examples of Linguistic Structure Analysis
Lexical Analysis 1: Sentence Splitting Witte (2016)
• Sentence is very important in NLP, but it is not critical for some Text Mining tasks
Lexical Analysis 2: Tokenization
• Text is split into basic units called Tokens
✓ word tokens, number tokens, space tokens, …
MC Scan
Space Not removed Removed
Punctuation Removed Not removed
Numbers Removed Not removed
Special characters Removed Not removed
Lexical Analysis 2: Tokenization
• Even tokenization can be difficult
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?
✓ What to do with hyphens?
▪ database vs. data-base vs. data base
✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?
✓ Some languages do not use whitespace (e.g., Chinese)
• Consistent tokenization is important for all later processing steps.
Lexical Analysis 3: Morphological Analysis Witte (2016)
• Morphological Variants: Stemming and Lemmatization
Lexical Analysis 3: Morphological Analysis Witte (2016)
• Stemming
Lexical Analysis 3: Morphological Analysis Witte (2016)
• Lemmatization
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization
Word Stemming Lemmatization
Love Lov Love
Loves Lov Love
Loved Lov Love
Loving Lov Love
Innovation Innovat Innovation
Innovations Innovat Innovation
Innovate Innovat Innovate
Innovates Innovat Innovate
Innovative Innovat Innovative
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization with crude example
Stemming Lemmatization
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)
• Part of speech (POS) tagging
✓ Given a sentence X, predict its part of speech sequence Y
▪ Input: tokens that sentence may have ambiguity
▪ Output: most appropriate tag by considering its definition and contexts (relationship with
adjacent and related words in phrases, sentence, or paragraph)
✓ A type of “structured” prediction
• Different POS tags for the same token
✓ I love you. → “love” is a verb
✓ All you need is love. → “love” is noun
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: English
Penn Treebank
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: Korean
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)
• POS Tagging Algorithms
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging Algorithms
✓ Pointwise prediction: predict each word individually with a classifier (e.g. Maximum
Entropy Model, SVM)
✓ Probabilistic models
▪ Generative sequence models: Find the most probable tag sequence given the sentence
(Hidden Markov Model; HMM)
▪ Discriminative sequence models: Predict whole sequence with a classifier (Conditional
Random Field; CRF)
✓ Neural network-based models
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ Encode features for tag prediction
▪ Information about word/context: suffix, prefix, neighborhood word information
▪ eg: fi(wj, tj) = 1 if suffix(wj) = “ing” & tj = VBG, 0 otherwise
✓ Tagging Model
▪ fi is a feature
▪ λi is a weight (large value implies informative features)
▪ Z(C) is a normalization constant ensuring a proper probability distribution
▪ Makes no independence assumption about the features
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Probabilistic Model for POS Tagging
✓ Find the most probable tag sequence given the sentence
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model
✓ Decompose probability using Baye’s Rule
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model: Hidden Markov Model
✓ POS → POS transition probabilities
✓ POS → Word emission probabilities
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Discriminative Sequence Model: Conditional Random Field (CRF)
✓ Relieve that constraint that a tag is generated by the previous tag sequence
✓ Predict the whole tag set at the same time, not sequentially
http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Collobert et al. (2011)
• Neural Network-based Models
✓ Window-based vs. sentence-based
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models
✓ Recurrent neural networks: have a feedback loop within the hidden layer
✓ Input-Output mapping of RNNs
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models: Recurrent neural networks
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Ma and Hovy (2016)
• Hybrid model: LSTM(RNN) + ConvNet + CRF
Lexical Analysis 5: Named Entity Recognition
• Named Entity Recognition: NER
✓ a subtask of information extraction that seeks to locate and classify elements in text
into pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
http://eric-yuan.me/ner_1/
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Dictionary/Rule-based
• List lookup: systems that recognizes only entities stored in its lists
✓ Advantages: simple, fast, language independent, easy to retarget.
✓ Disadvantages: collection and maintenance of list cannot deal with name variants and
cannot resolve ambiguity
• Shallow Parsing Approach
✓ Internal evidence – names often have internal structure. These components can be
either stored or guessed.
▪ Location: Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
▪ e.g.: Wall Street
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Model-based
• MITIE
✓ An open sourced information extraction tool developed by MIT NLP lab.
✓ Available for English and Spanish
✓ Available for C++, Java, R, and Python
• CRF++
✓ NER based on conditional random fields
✓ Supports multi-language models
• Convolutional neural networks
✓ 1-of-M coding, Word2Vec, N-Grams can be used as encoding methods
BERT for Multi NLP Tasks
• Google Transformer
✓ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., ... & Polosukhin, I. (2017). Attention is all you need.
In Advances in Neural Information Processing Systems(pp. 5998-
6008).
✓ Excellent blog post explaining Transformer
▪ http://jalammar.github.io/illustrated-
transformer/
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.