NLP basics
only 21% of data is stored in structured form rest is getting generated
generated data is mostly unstructured and in text form
What is NLP?
Natural language processing
helps computer to understand and work with humans
applications include: summarization, translating languages, autocomplete
searches, chatbots, voice assistance and many more
What are Corpus, Tokens, and Engrams?
corpus
collection of text documents. documents further comprises of paragraphs which
is further comprised into lines and then comes individual characters called tokens
Engrams
are defined as the group of n words together. For example, consider this given
sentence-
“I love my phone.”
In this sentence, the uni-grams(n=1) are: I, love, my, phone
Di-grams(n=2) are: I love, love my, my phone
And tri-grams(n=3) are: I love my, love my phone
NLP basics 1
So, uni-grams are representing one word, di-grams are representing two words
together and tri-grams are representing three words together.
Tokenization
process of splitting a text object into smaller units which is also called token
1) White space tokenization
also known as unigram tokenization
For example, in a sentence- “I went to New-York to play football.”
This will be split into following tokens: “I”, “went”, “to”, “New-York”, “to”, “play”,
“football.”
Notice that “New-York” is not split further because the tokenization process
was based on whitespaces only.
2) Regular Expression Tokenization
Normalization
Morpheme: it is the base form of a word
tokens are made up to two components mainly the morpheme which is the
base word and the inflectional form which is the prefix or suffix to morphemes
Normalization is converting a token into its base form
NLP basics 2
Types of normalization
Stemming
rule based process for removing inflectional forms from tokens and the
outputs are the stem of the word
stemming is not preferred because it will form words which are not in the
dictionary for example: winning will turn into winn
Lemmatization
Systematic step by step process for removing inflection forms of a word
it makes use of vocabulary, word structure, part of speech tags, and grammar
output of lemmatization is the root word called a lemma
Parts of Speech (PoS) Tags in NLP
NLP basics 3
Properties of words which define their main context
types of speech tags are: nouns, verbs, adjectives, adverbs etc
PoS have a large application and they are used in variety of tasks such as text
cleaning, feature engineering tasks and word sense disambiguation
Grammar in NLP
rules for forming well structured sentences.
Types of grammar:
Constituency Grammar
any group of word or word can be termed as constituents
it organizes any sentence into its constituents using their properties
these properties are driven by their part of speech tags, noun or verb
Another view to look at constituency grammar is to define their grammar in
terms of their part of speech tags.
Dependency Grammar
Dependency Grammar is a type of grammar that organizes words in a sentence
based on their dependencies, with one word acting as a root and all other words
linked to it. These dependencies represent relationships among words and are
used to infer sentence structure and semantics. Each dependency can be
represented as a triplet containing a governor, a relation, and a dependent.
Dependency grammars are used in various applications, including Named Entity
Recognition, Question Answering Systems, Coreference Resolution, Text
Summarization, and Text Classification.
NLP basics 4