WEEK 2
Natural Language
Processing
CSC 4106
MUHAMMAD ATIF SAEED
LECTURER (Artificial Intelligence & Robotics)
Course Outline
• Natural Language Processing: Toolkits and Concepts
• Toolkits
• Natural Language Tool Kit (NLTK), Apache OpenNLP, Stanford
Core NLP, Unstructured Information Management Application
(UIMA)
SLIDE 02
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Implementation of NLP Application
Step 1:
- Analyze the Task - Define the Framework
Step 2:
- Preprocess Data - Inspect and get Insights
Step 3:
- Define Relevant Information - Extract Information
Step 4:
- Select Appropriate Algo - Implement the Algo
Step 5:
SLIDE 03
- Apply your Algo in Practice - Test and Evaluate
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Step 1: Analysis of the task
Define what exactly the task involves: e.g., ask yourself, how you would
solve it yourself (without ML)
• In spam filtering: you probably pay attention to certain characteristics
(sender, fonts, format, how many recipients the email has, etc.)
• You also may pay attention to the content: “lottery”, “click on this
link”, “your account is blocked”, and similar
• Most probably, you classify the emails in two types – normal emails
and spam
SLIDE 04
• ⇒ Binary classification task
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Step 2: Analysis and preprocessing
• Given the “red flags” (words and phrases) you may attempt using
templates
• For machine learning, define what the relevant data is and how to
prepare it:
• You need access to labelled data of two classes
• What is the distribution of classes?
• Are you going to use only textual features?
• Are there any other significant differences (e.g., spam emails being
SLIDE 05
considerably shorter)?
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Step 3: Definition and extraction of the relevant
information
• Identify relevant signal in the data
• Is it single words (“lottery”, “blocked”) or phrases (“click on this link”)?
• Are you going to learn from misspellings?
• Are you going to learn from different ways to spell words (e.g., “Now”, “now”,
“NOW”)?
• Are you going to learn from word occurrences or word distribution?
• Will you apply any other normalisation techniques?
• The above points refer to feature selection, feature representation,
SLIDE 06
and feature weighting
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Step 4: Implementation of the algorithm
• No algorithm can be considered absolutely the best for all tasks
and all datasets
• Analyse the task to identify which one suits best in each
particular case
SLIDE 07
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Step 5: Testing and evaluation
• It is important to understand how your current algorithm performs and
what you can do better
• E.g., for classification tasks, you can measure accuracy, precision, recall, F1
• Arguably, it is better to let some annoying spam messages to slip through
than send important “normal” emails to the spam box – is precision or
recall more important?
• It is advisable to set up some baseline: What is the majority class
distribution? How would the simplest algorithm perform? Are you really
SLIDE 08
doing better using a more sophisticated approach?
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Natural Language Toolkit (NLTK)
Natural Language Toolkit (NLTK) is a leading platform for building
Python programs to work with human language data. It provides
easy-to-use interfaces to over 50 corpora and lexical resources
such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and
semantic reasoning, wrappers for industrial-strength NLP libraries.
SLIDE 09
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
spaCy
spaCy is a library for advanced Natural Language Processing in
Python and Cython. It's built on the very latest research, and was
designed from day one to be used in real products. spaCy comes
with pretrained pipelines and currently supports tokenization and
training for 60+ languages. It also features neural network models
for tagging, parsing, named entity recognition, text classification
and more, multi-task learning with pretrained transformers like
SLIDE 10
BERT.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
CoreNLP
CoreNLP is a set of natural language analysis tools written in Java.
CoreNLP enables users to derive linguistic annotations for text,
including token and sentence boundaries, parts of speech, named
entities, numeric and time values, dependency and constituency
parses, coreference, sentiment, quote attributions, and relations.
SLIDE 11
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
NLPnet
NLPnet is a Python library for Natural Language Processing tasks
based on neural networks. It performs part-of-speech tagging,
semantic role labeling and dependency parsing.
SLIDE 12
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Flair
Flair is a simple framework for state-of-the-art Natural Language
Processing (NLP) models to your text, such as named entity
recognition (NER), part-of-speech tagging (PoS), special support
for biomedical data, sense disambiguation and classification, with
support for a rapidly growing number of languages.
SLIDE 13
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Catalyst
Catalyst is a C# Natural Language Processing library built for speed.
Inspired by spaCy's design, it brings pre-trained models, out-of-the
box support for training word and document embeddings, and
flexible entity recognition models.
SLIDE 14
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache OpenNLP
Apache OpenNLP is an open-source library for a machine learning
based toolkit used in the processing of natural language text. It
features an API for use cases like Named Entity Recognition,
Sentence Detection, POS(Part-Of-Speech) tagging, Tokenization
Feature extraction, Chunking, Parsing, and Coreference resolution.
SLIDE 15
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
DyNet
DyNet is a neural network library developed by Carnegie Mellon
University and many others. It is written in C++ (with bindings in
Python) and is designed to be efficient when run on either CPU or
GPU, and to work well with networks that have dynamic structures
that change for every training instance. These kinds of networks
are particularly important in natural language processing tasks,
and DyNet has been used to build state-of-the-art systems for
syntactic parsing, machine translation, morphological inflection,
SLIDE 16
and many other application areas.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
MLpack
MLpack is a fast, flexible C++ machine learning library written in
C++ and built on the Armadillo linear algebra library, the
ensmallen numerical optimization library, and parts of Boost.
SLIDE 17
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
OpenNN
OpenNN is an open-source neural networks library for machine
learning. It contains sophisticated algorithms and utilities to deal
with many artificial intelligence solutions.
SLIDE 18
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Microsoft Cognitive Toolkit (CNTK)
Microsoft Cognitive Toolkit (CNTK) is an open-source toolkit for
commercial-grade distributed deep learning. It describes neural
networks as a series of computational steps via a directed graph.
CNTK allows the user to easily realize and combine popular model
types such as feed-forward DNNs, convolutional neural networks
(CNNs) and recurrent neural networks (RNNs/LSTMs). CNTK
implements stochastic gradient descent (SGD, error
backpropagation) learning with automatic differentiation and
SLIDE 19
parallelization across multiple GPUs and servers.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
NVIDIA cuDNN
NVIDIA cuDNN is a GPU-accelerated library of primitives for deep
neural networks. cuDNN provides highly tuned implementations
for standard routines such as forward and backward convolution,
pooling, normalization, and activation layers. cuDNN accelerates
widely used deep learning frameworks, including Caffe2, Chainer,
Keras, MATLAB, MxNet, PyTorch, and TensorFlow.
SLIDE 20
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
TensorFlow
TensorFlow is an end-to-end open source platform for machine
learning. It has a comprehensive, flexible ecosystem of tools,
libraries and community resources that lets researchers push the
state-of-the-art in ML and developers easily build and deploy ML
powered applications.
SLIDE 21
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Keras
Keras is a high-level neural networks API, written in Python and
capable of running on top of TensorFlow, CNTK, or Theano.It was
developed with a focus on enabling fast experimentation. It is
capable of running on top of TensorFlow, Microsoft Cognitive
Toolkit, R, Theano, or PlaidML.
SLIDE 22
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
PyTorch
PyTorch is a library for deep learning on irregular input data such
as graphs, point clouds, and manifolds. Primarily developed by
Facebook's AI Research lab.
SLIDE 23
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Scikit-Learn
Scikit-Learn is a Python module for machine learning built on top
of SciPy, NumPy, and matplotlib, making it easier to apply robust
and simple implementations of many popular machine learning
algorithms.
SLIDE 24
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Theano
Theano is a Python library that allows you to define, optimize, and
evaluate mathematical expressions involving multi-dimensional
arrays efficiently including tight integration with NumPy.
SLIDE 25
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Spark
Apache Spark is a unified analytics engine for large-scale data
processing. It provides high-level APIs in Scala, Java, Python, and R,
and an optimized engine that supports general computation
graphs for data analysis. It also supports a rich set of higher-level
tools including Spark SQL for SQL and DataFrames, MLlib for
machine learning, GraphX for graph processing, and Structured
Streaming for stream processing.
SLIDE 26
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Spark Connector
Apache Spark Connector for SQL Server and Azure SQL is a high-
performance connector that enables you to use transactional data
in big data analytics and persists results for ad-hoc queries or
reporting. The connector allows you to use any SQL database, on-
premises or in the cloud, as an input data source or output data
sink for Spark jobs.
SLIDE 27
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Unstructured Information Management
Unstructured Information Management applications are software
systems that analyze large volumes of unstructured information in
order to discover knowledge that is relevant to an end user. An
example UIM application might ingest plain text and identify
entities, such as persons, places, organizations; or relations, such
as works-for or located-at.
UIMA enables applications to be decomposed into components, for example "language
SLIDE 28
identification" => "language specific segmentation" => "sentence boundary detection" => "entity
detection (person/place names etc.)".
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Fuzzy Logic
Fuzzy logic is a heuristic approach that allows for more advanced
decision-tree processing and better integration with rules-based
programming.
SLIDE 29
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Fuzzy Logic (Cont.)
Fuzzy logic is an extension of classical Boolean logic that permits
the representation of intermediate values between completely
true and completely false. Instead of strict "0" or "1" values, fuzzy
logic allows degrees of truth ranging from 0 to 1.
In classical logic: "The weather is hot" might be True or False.
In fuzzy logic: "The weather is hot" could have a truth value of 0.7,
SLIDE 30
indicating it's somewhat hot but not extremely hot.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Fuzzy Logic (Cont.)
Text Classification:
Traditional text classification models assign texts to categories in a
binary manner, but documents often belong to multiple topics to
varying degrees.
Fuzzy classification allows assigning a document to multiple
categories with varying degrees of membership. For example, an
article about "climate change" could belong to both
SLIDE 31
"Environment" (0.8) and "Politics" (0.4).
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Support Vector Machine (SVM)
Support Vector
Machine (SVM) is a
supervised machine
learning model that
uses classification
algorithms for two-
group classification
SLIDE 32
problems.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Random Forest
Random forest is a commonly-used machine learning algorithm,
which combines the output of multiple decision trees to reach a
single result. A decision tree in a forest cannot be pruned for
sampling and therefore, prediction selection. Its ease of use and
flexibility have fueled its adoption, as it handles both classification
and regression problems.
SLIDE 33
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)