0% found this document useful (0 votes)

36 views36 pages

02 - Text Preprocessing - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views36 pages

02 - Text Preprocessing - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Lecture 2: Text Preprocessing

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Lexical Analysis
• Goals of lexical analysis
✓ Convert a sequence of characters into a sequence of tokens, i.e., meaningful
character strings.
▪ In natural language processing, morpheme is a basic unit
▪ In text mining, word is commonly used as a basic unit for analysis

• Process of lexical analysis

✓ Tokenizing
✓ Part-of-Speech (POS) tagging
✓ Additional analysis: named entity recognition (NER), noun phrase recognition,
sentence split, chunking, etc.
Lexical Analysis Hirschberg and Manning (2015)

• Examples of Linguistic Structure Analysis

Lexical Analysis 1: Sentence Splitting Witte (2016)

• Sentence is very important in NLP, but it is not critical for some Text Mining tasks
Lexical Analysis 2: Tokenization
• Text is split into basic units called Tokens
✓ word tokens, number tokens, space tokens, …

MC Scan
Space Not removed Removed
Punctuation Removed Not removed
Numbers Removed Not removed
Special characters Removed Not removed
Lexical Analysis 2: Tokenization
• Even tokenization can be difficult
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?

✓ What to do with hyphens?

▪ database vs. data-base vs. data base

✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?

✓ Some languages do not use whitespace (e.g., Chinese)

• Consistent tokenization is important for all later processing steps.

Lexical Analysis 3: Morphological Analysis Witte (2016)

• Morphological Variants: Stemming and Lemmatization

Lexical Analysis 3: Morphological Analysis Witte (2016)

• Stemming
Lexical Analysis 3: Morphological Analysis Witte (2016)

• Lemmatization
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization
Word Stemming Lemmatization

Love Lov Love

Loves Lov Love

Loved Lov Love

Loving Lov Love

Innovation Innovat Innovation

Innovations Innovat Innovation

Innovate Innovat Innovate

Innovates Innovat Innovate

Innovative Innovat Innovative

Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization with crude example

Stemming Lemmatization
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• Part of speech (POS) tagging

✓ Given a sentence X, predict its part of speech sequence Y
▪ Input: tokens that sentence may have ambiguity
▪ Output: most appropriate tag by considering its definition and contexts (relationship with
adjacent and related words in phrases, sentence, or paragraph)

✓ A type of “structured” prediction

• Different POS tags for the same token

✓ I love you. → “love” is a verb
✓ All you need is love. → “love” is noun
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: English
Penn Treebank
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: Korean
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• POS Tagging Algorithms

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging Algorithms
✓ Pointwise prediction: predict each word individually with a classifier (e.g. Maximum
Entropy Model, SVM)

✓ Probabilistic models
▪ Generative sequence models: Find the most probable tag sequence given the sentence
(Hidden Markov Model; HMM)
▪ Discriminative sequence models: Predict whole sequence with a classifier (Conditional
Random Field; CRF)

✓ Neural network-based models

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ Encode features for tag prediction
▪ Information about word/context: suffix, prefix, neighborhood word information
▪ eg: fi(wj, tj) = 1 if suffix(wj) = “ing” & tj = VBG, 0 otherwise

✓ Tagging Model

▪ fi is a feature
▪ λi is a weight (large value implies informative features)
▪ Z(C) is a normalization constant ensuring a proper probability distribution
▪ Makes no independence assumption about the features
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Probabilistic Model for POS Tagging
✓ Find the most probable tag sequence given the sentence
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model
✓ Decompose probability using Baye’s Rule
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model: Hidden Markov Model
✓ POS → POS transition probabilities

✓ POS → Word emission probabilities

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Discriminative Sequence Model: Conditional Random Field (CRF)
✓ Relieve that constraint that a tag is generated by the previous tag sequence
✓ Predict the whole tag set at the same time, not sequentially

http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Collobert et al. (2011)

• Neural Network-based Models

✓ Window-based vs. sentence-based
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models
✓ Recurrent neural networks: have a feedback loop within the hidden layer

✓ Input-Output mapping of RNNs

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models: Recurrent neural networks
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Ma and Hovy (2016)

• Hybrid model: LSTM(RNN) + ConvNet + CRF

Lexical Analysis 5: Named Entity Recognition
• Named Entity Recognition: NER
✓ a subtask of information extraction that seeks to locate and classify elements in text
into pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.

http://eric-yuan.me/ner_1/
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Dictionary/Rule-based

• List lookup: systems that recognizes only entities stored in its lists
✓ Advantages: simple, fast, language independent, easy to retarget.
✓ Disadvantages: collection and maintenance of list cannot deal with name variants and
cannot resolve ambiguity

• Shallow Parsing Approach

✓ Internal evidence – names often have internal structure. These components can be
either stored or guessed.
▪ Location: Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
▪ e.g.: Wall Street
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Model-based

• MITIE
✓ An open sourced information extraction tool developed by MIT NLP lab.
✓ Available for English and Spanish
✓ Available for C++, Java, R, and Python

• CRF++
✓ NER based on conditional random fields
✓ Supports multi-language models

• Convolutional neural networks

✓ 1-of-M coding, Word2Vec, N-Grams can be used as encoding methods
BERT for Multi NLP Tasks
• Google Transformer
✓ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., ... & Polosukhin, I. (2017). Attention is all you need.
In Advances in Neural Information Processing Systems(pp. 5998-
6008).

✓ Excellent blog post explaining Transformer

▪ http://jalammar.github.io/illustrated-
transformer/
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.

Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Lecture6 2022
No ratings yet
Lecture6 2022
101 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
POStagging
No ratings yet
POStagging
72 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Unit Ii Part of Speech Tagging and Syntactic Parsing
No ratings yet
Unit Ii Part of Speech Tagging and Syntactic Parsing
29 pages
Module 3
No ratings yet
Module 3
33 pages
NLP Assignment Notes
No ratings yet
NLP Assignment Notes
28 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
NLP Final
No ratings yet
NLP Final
27 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
NLP Sem 7 Imp Questions
No ratings yet
NLP Sem 7 Imp Questions
11 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
Ai TXT Unit4
No ratings yet
Ai TXT Unit4
39 pages
NLP Techniques: POS & Semantic Tagging
No ratings yet
NLP Techniques: POS & Semantic Tagging
30 pages
Session2 3
No ratings yet
Session2 3
18 pages
Important Questions-Answers Text Analytics and Natural Language Processing (KAI073)
No ratings yet
Important Questions-Answers Text Analytics and Natural Language Processing (KAI073)
37 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Natural Language
No ratings yet
Natural Language
68 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
NLPChapter 3
No ratings yet
NLPChapter 3
14 pages
NLP 9 Que
No ratings yet
NLP 9 Que
10 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Assignment-1: Natural Language Processing (21Cse356T)
No ratings yet
Assignment-1: Natural Language Processing (21Cse356T)
30 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Unit 5
No ratings yet
Unit 5
70 pages
Unit 2 Pos Tagger
No ratings yet
Unit 2 Pos Tagger
9 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
36 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
Viva Q&a
No ratings yet
Viva Q&a
5 pages
Unit III 1
No ratings yet
Unit III 1
11 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction
No ratings yet
Introduction
23 pages
NLP M1
No ratings yet
NLP M1
31 pages
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
No ratings yet
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
108 pages
Unit 1
No ratings yet
Unit 1
14 pages
NLP Notes Unit2 & Unit3
No ratings yet
NLP Notes Unit2 & Unit3
22 pages
NLP m2
No ratings yet
NLP m2
71 pages
How To Create A Complex Testbench in A Couple of Hours
No ratings yet
How To Create A Complex Testbench in A Couple of Hours
70 pages
01 - Introduction To Text Analytics - Part1
No ratings yet
01 - Introduction To Text Analytics - Part1
64 pages
Verification of An ARM Cortex-M3 Based SoC Using UVM
No ratings yet
Verification of An ARM Cortex-M3 Based SoC Using UVM
7 pages
02 - Text Preprocessing - Part3
No ratings yet
02 - Text Preprocessing - Part3
22 pages
05 - Text Representation II - Distributed Representation - GloVe
No ratings yet
05 - Text Representation II - Distributed Representation - GloVe
23 pages
Funasr: A Fundamental End-To-End Speech Recognition Toolkit
No ratings yet
Funasr: A Fundamental End-To-End Speech Recognition Toolkit
5 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
NLP Notes Last Sem
No ratings yet
NLP Notes Last Sem
48 pages
Optimal Hyperparameters For Deep LSTM-Networks For Sequence Labeling Tasks
No ratings yet
Optimal Hyperparameters For Deep LSTM-Networks For Sequence Labeling Tasks
34 pages
Benchmark Data Contamination of Large Language Models: A Survey
No ratings yet
Benchmark Data Contamination of Large Language Models: A Survey
31 pages
NLP Lab Manual for Students
No ratings yet
NLP Lab Manual for Students
24 pages
Moges Ahmed NER For Amharic Language PDF
No ratings yet
Moges Ahmed NER For Amharic Language PDF
76 pages
Boosting Court Judgement Prediction and Explanation Using Legal Entities
No ratings yet
Boosting Court Judgement Prediction and Explanation Using Legal Entities
36 pages
Amharic Text Entity Relation Extraction
No ratings yet
Amharic Text Entity Relation Extraction
12 pages
Evaluating Company-Specific Biases in Financial Sentiment Analysis Using Large Language Models
No ratings yet
Evaluating Company-Specific Biases in Financial Sentiment Analysis Using Large Language Models
10 pages
4.1.6.relation Extraction
No ratings yet
4.1.6.relation Extraction
6 pages
A Sequence-to-Set Network For Nested Named Entity Recognition
No ratings yet
A Sequence-to-Set Network For Nested Named Entity Recognition
7 pages
Winer: A Wikipedia Annotated Corpus For Named Entity Recognition
No ratings yet
Winer: A Wikipedia Annotated Corpus For Named Entity Recognition
10 pages
2020 Acl-Main 577
No ratings yet
2020 Acl-Main 577
7 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
Well Planning 1
No ratings yet
Well Planning 1
14 pages
SNLP Past Papers
No ratings yet
SNLP Past Papers
6 pages
Finer: Financial Named Entity Recognition Dataset and Weak-Supervision Model
No ratings yet
Finer: Financial Named Entity Recognition Dataset and Weak-Supervision Model
10 pages
Web Application For Screening Resume: January 2019
No ratings yet
Web Application For Screening Resume: January 2019
10 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
97% (33)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
On The Vietnamese Name Entity Recognition: A Deep Learning Method Approach
No ratings yet
On The Vietnamese Name Entity Recognition: A Deep Learning Method Approach
5 pages
Named Entity Recognition From Unstructured Handwritten Document Images
No ratings yet
Named Entity Recognition From Unstructured Handwritten Document Images
6 pages
Automating Data Analyses Using Artificial Intelligence
No ratings yet
Automating Data Analyses Using Artificial Intelligence
114 pages
ChatIE - Zero-Shot Information Extraction Via Chatting With ChatGPT
No ratings yet
ChatIE - Zero-Shot Information Extraction Via Chatting With ChatGPT
16 pages
NLP Practicals
No ratings yet
NLP Practicals
54 pages
Person Name Entity Recognition For Arabic
No ratings yet
Person Name Entity Recognition For Arabic
8 pages
Thattinaphanich 2019
No ratings yet
Thattinaphanich 2019
6 pages
An End-To-End Named Entity Recognition Platform For Vietnamese Real Estate Advertisement Posts and Analytical Applications
No ratings yet
An End-To-End Named Entity Recognition Platform For Vietnamese Real Estate Advertisement Posts and Analytical Applications
17 pages
BigID Data Classification
No ratings yet
BigID Data Classification
40 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages