Introduction
|
Introduction
Introduction:
1.1 What is Natural Language Processing?
1.2 Origins of NLP,
1.3 Language and Knowledge,
1.4 The Challenges of NLP,
1.5 Language and Grammar,
1.6 Processing Indian Languages,
1.7 NLP Applications.
1.8 Language Modelling: Statistical Language Model - N-gram model (unigram, bigram),
1.9 Paninion Framework,
1.10 Karaka theory.
1.1 What is Natural Language Processing?
>
>
Humans communicate through some form of language either by text or speech.
To make interactions between computers and humans, computers need to understand natural
Janguages used by humans.
Natural language’ processing is all*about making computers lear, understand, analyze,
‘manipulate and interpret natural (human) languages.
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
languages or Linguistics, and Artificial Intelligence.
Processing of Natural Language is required when you want an intelligent system like robot to
perform as per your instructions, when you want to hear decision from a dialogue based clinical
expert system, ete.
The ability of machines to interpret human language is now at the core of many applications
that we use every day - chatbots, Email classification and spam filters, search engines, grammar}
checkers, voice assistants, and social language translators.
‘The input and output of an NLP system can be Speech or Written Text.
Natural language processing (NLP) is the ability of a computer program to understand human|
language as it's spoken and written ~- referred to as natural language. It's a component of
artificial intelligence (Al),
VI Semester, CSE (D5) Dr. Murali G, ProfessorIntroduction |
|
Natural Language Processing (NLP) is concemed with the development of computational
models of aspects of human language processing. There are two main reasons for such
development:
To develop automated tools for language processing.
To gain a better understanding of human communication.
Building computational models with human language ~proces s requires knowledge of|
how human acquire, store and process language.
Natural Language Processing (NLP) is a fascinating field that sits at the crossroads of linguistics,
computer science, and artificial intelligence (AI). At its core, NLP is concerned with enabling]
computers to understand, interpret, and generate human language in a Way that is both smart and|
useful.
Rey
i
]
J
MACHINE
LEARNING
1.2 Origins of NLP
Natural language processing sometimes termed natural language understanding originated from machine|
translation research. While natural language understanding involves only the interpretation of language,
natural language processing includes both understanding (interpretation) and generation (production). The
NLP also include speech processing, in this text book concerned with text processing only, covering work
in the ‘area computational linguistics and the tasks in which NLP has found useful application.
Computational linguistics phenomena. It deals with the application of linguistics theories and
‘computational techniques for NLP. In computational linguistics, representing a language is major problem;
most knowledge representations tackle only a small part of knowledge. Computational models may be|
broadly classified under knowledge driven and data driven categories. Knowledge-driven systems rely on
explicitly coded linguistic knowledge, often expressed as a set of handcrafted grammar rules,
Vi Semester, CSE (D5) Dr. Murali G, ProfessorNLP Introduction
aches presume the existence of'a large amount of data and usually employ some machine
Data driven appro:
learning technique to leam syntactic patterns.
and a few traditional applications. With the
The NLP is no longer confined to classroom teach
unprecedented amount of inforn
mation now available on the web, NLP has become one tof the leading}
formation retrieval is used here ina broad}
techniques for processing and retrieval information, ‘The term
ns such as information extraction, text
manner to include a number of information processing appli
summatizations, question answering and so forth.
1.3Language and Knowledge.
Language is the medium of expression in which knowledge is deciphered. Language, being a medium
., is the outer form of the content it expresses. The same content can be expressed in|
of expres
different languages.
Language (text) processing has different levels, each involving different types of knowledge. The|
various levels of processing and the types of knowledge it involves..
1. Lexical analysis
2. Syntactic analysis Taka haan
+
Syatacic Anais
4, Discourse analysis . +
Semantic Anaysis
3. Semantic analysis
5. Pragmatic analysis
ragmatle Anas
Lexical analysis,
is lexical analysis, which involves analysis of words. Words are|
> The simplest level of analy:
the most fundamental unit of any natural language text.
knowledge about the structure
> (Word-level processing requires morphological knowledge,
and formation of words from basic units (morphemes). The rules for forming words from
morphemes are language specific.
Syntactic analysis
> The next level of analysis is syntactic analysis, which considers a sequence of words as a unit,
usually a sentence, and finds its structure.
> Syntac
they relate to each other. It capture s grammatically or non-grammatically of sentences by]
analysis decomposes a sentence into its constituents ( or words) and identifies how
.e word order, number, and case agreement.
looking at constraints
ViSemester, CSE (DS) Dr. Murali G, ProfessorIntroduction’
> For example; “She is going to the market” is valid, but “she are going to the market” is not.
Semantic analy:
> Semantic analysis semantics is associated with the meaning of the language.
analysis is concerned with creating meaningful representation of linguistic inputs. The general
ides of semantic interpretation is to take natural language sentences or utterances and map them!
onto some representation of meaning.
> For example; “Colourless green ideas sleep furiously”, ‘The sentence is well-formed,
syntactically correct, but semantically anomalous. However this does not mean that syntax has
no role to play in meaning.
‘ssSemantics to be a projection of its syntax. That is semantic structure is interpreted
syntactic structure
Discourse analysis
> Higher level of analysis is discourse analysis. Discourse-level processing attempts to interpret
the structure and meaning of even larger units;"€.@ at the paragraph and document level, in|
terms of words, phrases, cluster and sentences. It requires the resolution of anaphoric references|
and identification of discourse structure.
> For example; in the following sentences, resolving the anaphoric reference they’ requires}
pragmatic knowledge:
The district administration refused to give the trade union permission for the meeting because they
feared violence.
The district administration refused to give the trade union permission for the meeting because they
oppose government,
Pragmatic analysis
> The highest level of processing is pragmatic analysis, which deals with the purposeful use of|
sentences in situations. It requires knowledge of the world, ic., knowledge that extends beyond!
the Contents of the text.
1.4 The Challenges of NLP
> There are a number of factors that make NLP difficult. These relate to the problem of|
representation and interpretation.
Language computing requires precise representation of content, Given that natural languages|
| are highly ambiguous and vague; achieving such representation can be difficult,
ss
‘Vi Semester, CSE (DS) Dr. Murali G, ProfessorIntroduction
NLP.
another source of difficulty.
of knowledge that humans use to process
} The inability to capture all the required knowledge
> It is almost impossible to embody all sources
Janguage.
ural language is identifying its semanti
tobe a composi
> The greatest source of difficulty in
compositional semantics considers the meaning of a sentence
meaning of words appearing in it.
ee-seoping is another problem. The scope of quantifiers is often clear and poses
> Quan
problem in automatic processing.
guages is another difficulty.
ich effort, we can identify words that have multiple meanings
> The ambiguity of natural lan; “The first level of the ambiguity arises
at the word level. Without mu
associated with them, e.g., bank, can, bat and still.
1.5 Language and Grammar,
} Language needs to be understood by Device instead of Knowledge
> Grammar defines Language, it consists set of rules.that allows to parse & generate sentences in|
a language.
> ‘Transformational grammars are required, proposed by Chomsky. It consists of lexical functional]
‘grammar, Paninian Grammar, tree!
grammar, generalized phrase structure grammar, Dependency
adj
> Generative
‘generate gramm:
ing grammar etc.
grammars are often referred to general framework it consist set of rules to specify or
tatical sentence sin a language
Example
Pooja playa Veena Veena _ is played by Pooja
Surface structure
subj al
rel
Pooja plays. Veena
Deep structure
‘Surface and deep structure of sentence
Figure 1.1
Dr. Murali G, Professor Page 5
‘Vi semester, CSE (DS)ion
NUP Introductio!
strodueed by in 1987, Chomsky argued that an utterance is the surface)
‘Transformational grammar ws
«lin a number of ways to yield
representation of a “deeper structure’, The deeper structure ean be transformes
any different surface-evel representations. Sentences with different surface-level representation having}
the same meaning.
Chomsky’s theory was able to explain why sentences like
Pooja plays veena,
Veena is played by Pooja
have the same meaning, despite having different surface structures,
‘Transformational grammar has three components:
1. Phrase structure grammar.
2. Transformational rules
3. Morphophonemic rules- these rules match each sentence’representation to.a string of phonemes.
Phrase structure grammar consists of rules that generate natural language sentenaces and assign a
structural description to them.
For example, consider the following set of rules;
Aux-> will, is, can
‘+ In these rules, $ stands for sentence, NP for noun phrase, VP for verb phrase, and Det for
determine. Sentences that can be generated using these rules are termed grammatical.
‘© The second component of transformational grammar is a set of transformation rules, which
transform one phrase — maker (underlying) into another phrase- maker (derived). These
‘vi Semester, CSE (DS) Dr. Murali G, Professor Page 6|Introduction
rules are applied on the terminal string generated by phrase structure rules. These rules are:
used to transform one surface representation into another, e.g., an active sentence into}
passive one,
Morphophonemic rules match each sentence representation to a string of phoneme.
Consider the active s
The police will catch the snatcher, s+..+(1)
“The application of phrase structure rules will assign the structure shown in figure to the sentence.
s
ca
ee
police Verb NP
one
Det Noun
}o=| \
Bs catch the snatcher
Figure: 1.2 : parse structure of sentence
‘The passive transformation rules will convert the sentence into:
‘The + culprit + will + be + en'+ catch + by+ ploice
Det Noun
\ \
the police
Figure 1.3: Structure of sentence (1) after applying passive transformations.
‘ViSemester, CSE (DS) Dr. Murali G, ProfessorNLP Introduction]
1.6 Processing Indian Languages
There are a number of differences between Indian language and English. This introduces difference in
their processing. Some of these differences are listed here,
> Unlike, English, Indie scripts have a non-linear structure,
> Unlike English, Indian languages have SOV( Subject-Object-Verb) as default sentence
structure,
Indian languages have a free word order, i.c., words can be moved freely within, a sentaence
without changing the meaning of the sentence.
Spelling standardization is more subtle in Hindi than in English.
>
> Indian languages have relatively rich set of morphological variants.
>
Indian languages use post-position ease markers instead of prepositions:
> Indian languages use verb complexes consisting of sequences of verbs.
1.7 NLP Applications
Machine ‘Translation — refers to automatic translation of text from one human language to another. It
is necessary to have an understanding of words and phrases, grammars of the two languages involved,
semantic of the languages, and world knowledge.
Speech Recognition ~ This isthe process of mapping acoustic speech signals to a set of word.
‘Speech Synthesis- refers to automatic production of Speech. Such systems can read out your mails on
telephone, or even read out a storybook for you. So NLP remains an important component of any
speech synthesis system.
Natural language, interfaces to Databases- Natural language interfaces allow querying a structured
database using natural language Sentences.
Information Retrieval- This is concerned with identifying documents relevant to a user's query. NLP)
techniques have found useful applications in information retrieval.
Information Extraction- An information extraction system captures an outputs factual information
céntained Withinya document. Similar to an information system, it responds to a user’s information|
need.
Question Answering- A question answering system attempts to find the precise answer or at least the|
precise portion of text in which the answer appears. A question answering system is different from an
information extraction system in that the content that is to be extracted is unknown.
Text Summarization - This deals with the creation of summaries of documents and involves systactic, |
semantic, and discourse level processing of text.
‘Vi Semester, CSE (DS) Dr. Murali G, ProfessorNLP Introduction
1.8 Language Model
woe
Model is a description of sonie complex entity orprocess. Language model is thus a description of
language.
Natural language is a complex entity in order to process it we need to represent or build a model,
this is known as language modelling,
Language model can be viewed as a problem of grammar inference or Problem of probability esti
mation.
Grammar based lang model attempts to distinguish a grammatical sentence from a non grammatical
one ,Where probability model estimates maximum likelihood estimate.
‘There are two approaches for language modelling.
‘One is to define a grammar that can handle the Janguage.
‘The other is to capture the patterns in a grammar language statistically.
1.8.1 Statistical Language Model: &
>A Statistical language model is a probability distribution! P(s) over Fall possible word sequences (or
any other linguistic unit like words, sentences, paragraphs, dociiments or spoken utterances). A\
number of statistical language models have been proposed in literature. The dominant approach in|
statistical language modelling is the n-gram model.
>SLMs are fundamental task in many NLP applications
like speech recognition, Spell correction, machine translation, QA, IR and Text summarization.
n-gram Model.
The goal a statistical language ‘model is to estimate the probability of a sentence, This is achieved by
decomposing, sentence probababitty into a product of conditional probabilities using the chain rule as
follows: é 2 :
P(s) = P (Wi, Wa, Wes Wes)
SP (wi) PCws/1) P(wsa/w1 wa) P(wa/w; We Wa). seeeP(Wa/ Wi We «+ Wret))
F POW, / hi). i
a
where hiis history of word w; defined a wi, Was -.....Wiet
so, in order to calculate sentence probability, we need to calculate the probability of a word, given the}
sequence of words preceding it. An n- gram model simplifies the task by approximating the probability]
given previous n-1 word only.
P(w)/ hi) = P (Wiener Wer)
Vi Semester, CSE (DS) Dr. Murali G, ProfessorNLP
= Introduction
= :
Thus an n-gram medel cottutebe pero thi) by modelling
leupuege ot Mmathov model of rd nl, bee, by looker
ab prevouk m-l voorda ly fs
t
7 USing bigrvam and cht-gram aitimale,
Aentente, can be cabiutebedt ot!
Peay a Tp tuifven) od
Pld 2 Hp wifey PY)
ht
too. probalet iihy ofa
An an example , fre biagam appoximabton of PM
Pleat Hoe),
rokucer a Pi.gram appeximahion 4
Pleat Jot He)
DA Apectad woowel ( psenrde usocd ) es 7314 Jafroduuord to morte
shee epiatng of dre Acnten ts Ehagram erbimabien <7
Prokarsuh, of due dirk word te a Antenne 14 Condihened
on Ss> 0 inty, te tA 4am & hina fodeduee +0
Poendo words 4412 ond 6427 *
do edtimabe tree protaciiih'er,
how, wa dmuss hele
~sram model ™
hae Medime baedrafates dhe 2 :
Frraaning terpeys we esting 19" parameters Utes Hos
Leet exhmabten (MLE) ferety ets ice
TA due refutes corps end ctu de.
n-gram desk Abo dus ame
woximum let?
A paahoted nage”
Me by dee AU at
prepiss = COR nt =
The Aum op atl n-grams uod Abose frat n-l toovd4 ty
‘Vi Semester, CSE (05) "Dr. Murali, Professor
CCPLnd, wey
ECB nay ~
p (wi [Minti woud tExam ple Tratalng Seb <@)
The Avablan lenights
Tete BRL Hoe fatry take e dae cork
Vale ghar” A doe pHenblan lenignhs are bransladed he
ved. manny Lo ugueged, a} uae :
Bi-gram modet ’.
plie/es>) = 0°67 plAvabian| fea) 20:8 pLEnigitel Avalinn)=ho
P (are /prote > 1:0 P Cte lade) 20-4" PU fairy |be) > 0-2
Plab)er [forry) obo p Cob] teks) > 1-0. p CYefof) elo
Pleat/te) so-g° plstta/te) sor. pl of | Stories) 210
P Coamfbnrghta)oie pftrantiated [aes 9. Win] irandetd bio
© plmany [i'n ) 21-0 Leg sae
Pl largunger fmany ) 20
Pek Sentencels) > The -trabian. Kntghta ate due fairy totes
OF bees Sth *
(me /<5>) x p CAvebian [teo) x pCenignhs [#rabian) X plarfeni
gure) xc p (trelare) x plfairy |de) x pl tated Facry) x PlotHets,
% Plthe|oh) LP [ eott] fee)
= bien x Osx tro X10 LO! SH
MOL Kl O KO LOX 0-2
: = Op0 7
fra each probability ft necerranil y wen dome Tem ely}
dus probatith'a mighh came numoweol vrderty
Ho wo, parhitaty Pa long Auntenca go-averd Hah,
d ja log Apree , polete or codlerdahing
toreipmda to addons log “E Tadevddinnk probabiitre g
Aetosy antes Be bum-
f The n-gram model Aut from lata Spaarenen prolle
-m. An n-goem fred does not otcur In dea bates
-dabe TA amyned zero preva DAY ese aan
—terpur har Seurst Dro enbhiy wh Th ae —e>
A numbev Amootrins -feetniqyuet ceapy
te bandte dina dabe Apardencs probum, Hee pimple
due berg adel-o™m smotthtug,
“Smoothing ja general referr te foe tare 4
Pe-evaliahus 2 eyo-probability ar low-probaloiihy
Nograma ancl amignins from non-rer0e Valu!
Add -one Smeotting
Then ta Hee ainpliak Ambotrrg derbntque LE
addr a valu. y one to |ach -g7am Preayuan ery
Layne Tate, probabil’ y Thus fas
loepone nommoblainy
con dittoned: pritauiitydre cern ot
$ Closer phy
pea
Lobos ME ne
Ma fue nember of agra Jrok o€ted exAchy
dime ta dee trataing loeput -
Feample — carsriles doo duc Number Pn-grar dicak oteer
cama tA vgs oe gad fre number of n-grama fact cere
CHMe My 80:SUe = Tren He Avkopthed Cou fir & hy
be MOSM 2 4.04
410.
ay we
ae_jaPaninian Tromewon CPP) : @
Sppanintaw “grammer Cea) poor Written by Panini ia
Soo Be Hw sonalen be ctrame wor Can be wed for
Olen Dadian lauguege ad podivbly Zorne Aoan
language at Locll:
> Lnthe Baglian, Adotan languages ee Sov (Subfed
object Verb) ordond & In PlecHen.atty pick - Thee
Inflectons prod e. Impertank syntochie ‘aud Same.
ea cones for, begenge only aod! undo
~tandius +
Seme Imeectant Feoturs Of Ladau fee
@ Indian larspeager Looe croditonalty wtel oval
Comme ni Coton fer wenerleh oe propagation « T™ peyps
Oh faere languages Me do) Cen pietertdes dom
rev Argecurs ” mind do Mee tisdeneva mind » Suck
oro. pet beeen ye mise tera merpre
sees ae Lamune, BNe Ho wretauely
Visca cord xd tree. ae Lee Santlett
Von? qe ficetbt thy hy do aloe word Groups
yepreunHing Aubpert, eee oe ver do occur
Gay OS odes. pr Obrers, Wer util wie Com Llane
Me pees of Avbjeck aa object.
Foes) 9) air qa ar et tt a)
ry Rashi po Huanan dele hot
Mote, eltid to food Gre -W)
Moker
“pier fog to tee OE
8 me = a a ar are\
be Maan bie
Cet ages matter bees ae es)
Tok
aiMe augteay verbs foilos Jue mein wea Da Weds tse
ta woul tnd Drug
“Rata as Acporobe wordt, “woueteat ie
atau) Largan gar, tembhue wile due motu
for an,
SS areags aa £
Khan rebar hal keane yaaa bal
eat hag clo tug been hoy
ceo ws har been def AY
RIK Sudtan lesguonger y dus novm OO fottosed by
Pest PORK jrrteocd prepoMlions
bayeredt Repretentabon in PS ann
the GB repre hme Syntach'e Levels *
deep Siuehis eet
= SdJem i ond
= logcet form (ur), oe : wa
Sommer Tuy daeeny pres do rete
laiginge Me ab Synbacke Saat ony
ho! Pee LE Me neeresvbo
2 Pantin grammed framers bA hesd tobe
Synbactiio oSemanhtey hes, Oretan 90 frorr Sushece
2 Waje, 20 dace CemanHitr by paudas disoush Yfermed
lade ayers TL nuginge Cane be peprenteal
an Fotiouss | reaspriest cent ronerecuats chee achsel Seelling ox.
Semantre work ha concemed voths Mee meaning 4 untby
leangen peony wordy , Anrus O% Chand ond Stntentes +
Do Whnakh Ubnatty mean Inflechtan, but here, It refer
co word Cnouns Verb oF oder ) gues bared etter
thar , or coenpornd verbs,
OM core. exrdtaga, vr post =Po?
or main and abtaribray wees ete. ALL Indie language;
Ca Ga repttenbed, ob dre virbhokh! lente
>) Kataoka (prnovnad Kaaraka ) Uberatty means Come, od
FeGe Pan intan “Gromman har 144 coun way Of
dekuing Radata Wlbltert puteteb Tuten rete
ose bared on dua wary aa wooed groups poakictpale
In rathy ck by deena qour: Br Hots
See, Tks eRe a Abe OE Fall; syntechc:
\oKanate
Fercke theeny JA due contiel theme 4 PG
ox rat PAGER by Vertous Participants, Th toe
Mean BUR ONY! Theae! vole are wedlectedd Jn dha
marters CVarguoge Seeectc) and pout petition
Marken (parsarsay./ :
Tt Verions Kero¥es, Marios Karta ttupsect), karma
Lotject), Rasana Linttument), Sam prredana Coenchetar)
— a
Arodan Lsepanaiton) aut Mitveran Cleeur) » there
Aeron priors ade Gat etampler aud not Complte
AMeuion PG or Karoka theory «
Te cxpladn marntous barnlea relator, Ub us Contleles
on emamele,
NA dere dette wo)
poe S J ©alt geht atosiben at gm at AE fear J
Maan bacheht bo aangan meth hoot se yoHi euoukitbn. hut
Mother ‘eutd 4S Courkerd 1 loudly bread Peed C4):
The mode feeds loyend fo doc cued by hand ta dis Coumtyard
te
The hear Impertank poraw i dubjedy eS ‘ee ote
Weeer MA mo
Kasta ta dened ot HES ride GRUP Lo
are
, ; ae
“koxmal ta ites do Cbyceh and Wirths Locus of ua res a
ue ackuthy. Py 20, HT Cbread) ta ee leadma -. weil.
Sanat" PRAlbasear) i WESC, [Aro Now grout chen :
whe goal iw ha dnsemadhs Bn eye host Chand) 4 tee hater
m ¢ dus % ese. Bacher,
“Sam predan!) ay Hue benaferary op een i Sy
ve
Cerstay Beane
V@peeden! denotes Separchion «
LPssuey jr Panintan Grammar
bee tay abet
The +tuso ~protery CRETE Saco
1) Gemputatronal impimetiation of Pa. a
2) Adapfatien % PG to indian, § obbey Aimilar buryeoger,
o8a_
CUtS cogs Revada! (olefawttieneephion), utern miler
oss k Vn muthiple Layers tn Auelen usc ae
eay ae Comsasts! Ge Yule wottcr are tea MTP eR CEP
to rd Wn de VWtgher buen:
SJ Institute of logy
~~. No G7. BGS Hoalth & Education City,
Vitara Road, Kengo, Bongoloru-s60 060