[go: up one dir, main page]

0% found this document useful (0 votes)
18 views4 pages

NLP Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

NLP Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

NLP basics

only 21% of data is stored in structured form rest is getting generated

generated data is mostly unstructured and in text form

What is NLP?
Natural language processing

helps computer to understand and work with humans

applications include: summarization, translating languages, autocomplete


searches, chatbots, voice assistance and many more

What are Corpus, Tokens, and Engrams?


corpus
collection of text documents. documents further comprises of paragraphs which
is further comprised into lines and then comes individual characters called tokens

Engrams
are defined as the group of n words together. For example, consider this given
sentence-
“I love my phone.”
In this sentence, the uni-grams(n=1) are: I, love, my, phone
Di-grams(n=2) are: I love, love my, my phone
And tri-grams(n=3) are: I love my, love my phone

NLP basics 1
So, uni-grams are representing one word, di-grams are representing two words
together and tri-grams are representing three words together.

Tokenization
process of splitting a text object into smaller units which is also called token

1) White space tokenization


also known as unigram tokenization

For example, in a sentence- “I went to New-York to play football.”


This will be split into following tokens: “I”, “went”, “to”, “New-York”, “to”, “play”,
“football.”
Notice that “New-York” is not split further because the tokenization process
was based on whitespaces only.

2) Regular Expression Tokenization

Normalization
Morpheme: it is the base form of a word

tokens are made up to two components mainly the morpheme which is the
base word and the inflectional form which is the prefix or suffix to morphemes

Normalization is converting a token into its base form

NLP basics 2
Types of normalization
Stemming
rule based process for removing inflectional forms from tokens and the
outputs are the stem of the word

stemming is not preferred because it will form words which are not in the
dictionary for example: winning will turn into winn

Lemmatization
Systematic step by step process for removing inflection forms of a word

it makes use of vocabulary, word structure, part of speech tags, and grammar

output of lemmatization is the root word called a lemma

Parts of Speech (PoS) Tags in NLP

NLP basics 3
Properties of words which define their main context

types of speech tags are: nouns, verbs, adjectives, adverbs etc

PoS have a large application and they are used in variety of tasks such as text
cleaning, feature engineering tasks and word sense disambiguation

Grammar in NLP
rules for forming well structured sentences.

Types of grammar:

Constituency Grammar
any group of word or word can be termed as constituents

it organizes any sentence into its constituents using their properties

these properties are driven by their part of speech tags, noun or verb

Another view to look at constituency grammar is to define their grammar in


terms of their part of speech tags.

Dependency Grammar
Dependency Grammar is a type of grammar that organizes words in a sentence
based on their dependencies, with one word acting as a root and all other words
linked to it. These dependencies represent relationships among words and are
used to infer sentence structure and semantics. Each dependency can be
represented as a triplet containing a governor, a relation, and a dependent.
Dependency grammars are used in various applications, including Named Entity
Recognition, Question Answering Systems, Coreference Resolution, Text
Summarization, and Text Classification.

NLP basics 4

You might also like