Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
Sematic Analysis
Discourse Analysis
Pragmatic Analysis
Overview of NLP
NLP considers the hierarchical structure of a language where text represents set of symbols, several words
make phrases, several phrases makes sentences that conveys the idea and several sentences makes a paragraph
that conveys some context.
There are two main components of NLP
(i) Natural language understanding
It involves mapping given input in natural language to useful representation.
(ii) Natural Language generation
It involves producing meaningful phrases and sentences to convey some internal representation
Phases of NLP
1. Morphological Analysis/Lexical Analysis
It involves identifying and analyzing the structure of words. During lexical analysis the whole text is
divided in to paragraphs, sentences and words.
2. Syntactic Analysis (Parsing)
It involves analysis of words in the sentence for grammar and arranging words in a manner that shows
the relationship among them.
Example: The school goes to boy {Rejected by syntactic analysis}
3. Sematic Analysis
It involves identifying the exact dictionary meaning from text. That is it involves meaningful formation
of sentences.
Example: “hot ice cream” (Not Correct)
4. Discourse Analysis
The meaning of any sentence may depend on the meaning of the sentence that precedes it and may
also influence the meaning of the sentence that follow it,
Example: She wanted it. (Not clear unless we are aware of the previous sentence)
5. Pragmatic Analysis
The sentences representing what it said is reinterpreted to determine what was actually meant. It
involve deriving aspects of language which require real world knowledge. Example: Do you know
who am I? (Machine may not understand the expression behind the sentence)
MORPHOLOGICAL ANALYSIS
Morphology is the study of the structure and formation of words. Its most important unit is the morpheme,
which is defined as the "minimal unit of meaning". Consider a word like: "unhappiness". This has three parts:
The goal of morphological parser is to discover the morpheme that behinds a given word.
cats cat+s
cat is the root or stem word
s is the plural
Morphological analysis and parsing are essential for many NLP applications such as spelling error detection,
machine translation and information retrieval etc., There are two broad classes of morpheme
Stem: The main morpheme that contains the real meaning
Affixes: Modifies the meaning given by the stem
1. Inflection
Inflection is the process of changing the form of a word so that it expresses information such as number,
person, case, gender, tense, mood and aspect, but the syntactic category of the word remains unchanged. As
an example, the plural form of the noun in English is usually formed from the singular form by adding an s.
car / cars
table / tables
dog / dogs
In each of these cases, the syntactic category of the word remains unchanged.
2. Derivation
Inflection doesn't change the syntactic category of a word. Derivation does change the category. Combines
word stem with a grammatical morpheme resulting a word belonging to a different class.
Example:
Teacher teaching
(noun)(verb)
3. Compounding
Process of merging two or more words to form a new word.
Example: overlook, desktop, download
A morphological parser uses the following information sources
i. Lexicon: A lexicon list contains stems and affixes together with basic information about them.
ii. Morphotactics: It includes the allowable ordering of morphemes that constitutes a word.
iii. Orthographic rules: These are specifies the spelling rules that specifies the changes that occur
when two given morphemes are combined.
Example: y ier
Easy easier
iv. Stemmer: Example: rotationalrotate
rule: ationalate
Morphological analysis can be avoided if an existing lexicon is available that list features for all
word forms of all root words in a language. However, is isn’t practical to list all possible
combination of word forms in many languages. The simplest morphological systems are stemmers
that converts morphological variations of a given word to its stem. Stemmers does not requires
lexicon instead it uses a set of rewrite rules to derive the root word.
ieny
inge
SYNTAX ANALYSIS
Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. The
objective of syntactic analysis is to find the syntactic structures of the sentence that determines the meaning of
the sentence. It can be represented as a tree where nodes represent phrases, leaf nodes are words. For example,
the sentence like “hot ice-cream” would be rejected by semantic analyzer.
NP- Noun Phrase
NP is a phrase whose head is noun or pronoun optionally accomplished by a set of modifiers.
NP Noun
NP Pronoun
NP Det Noun
NP Adj Noun
NP Det Adj Noun
VP- Verb Phrase
VP can simply a verb attached to a noun phrase or proportional phrases etc.
VP Verb
VP Verb NP
VP Verb PP
Sentence
S VP
S NP VP
Wh question Structure
SWh NP VP
SWh NP Aux NP VP
Yes or No Question
SAux NP VP
Types of Parsing
Derivation divides parsing into the followings two types −
Top-down Parsing
Bottom-up Parsing
TOP-DOWN PARSING
In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then tries to
transform the start symbol to the input. The most common form of top down parsing uses recursive procedure
to process the input. Top down parsing starts its search from the root node and works downwards towards
the leaf node. The root is expanded using the grammar with ’S’ as non-terminal symbols. Each non-terminal
symbol in the resulting sub-tree is then expanded using the appropriate grammar rules.
S
S
NP VP
VP
Level 3:
S
S S
NP VP
NP VP NP VP
Det Noun
Pronoun
Det Noun PP
S S
VP VP
Verb Verb NP
Level 4:
VP
Verb NP
(paint)
Det Noun
(the) (door)
Example 2: Deepika reads the book
NP VP
Noun Verb NP
(Deepika (reads)
)
Det Noun
(the) (book)
Aux NP VP
(Does)
Det Noun
(a) (meal)
Advantages
There is a chance of generating wrong grammatical sentence as it starts generating the tree using the start
symbol of grammar.
Disadvantages
Time consuming because it checks each and every rule of parsing
BOTTOM-UP PARSING
In this kind of parsing, the parser starts with the input symbol and tries to construct the parse tree in an upward
direction towards the root. At each step the parser checks the rules in the grammar where the RHS matches
the portion of the parse tree constructed so far. It then reduces it using the LHS of the production. The parse
tree is considered to be successful if the parser reduces the tree to the start symbol of the grammar.
Example 1: Paint the door
NP
VP
S
Example 2: Deepika reads the book
NP
NP
VP
NP NP
VP
S
Advantages
It never wastes time in exploring a tree that does not match the input.
Disadvantages
It wastes time in generating trees that have no chance of leading to S rooted tree.
Disadvantages of parsing
Left recursion (leads to infinite loop)
Ambiguity can’t be resolved
Using the telescope, the boy saw the man The boy saw the man. The boy had a telescope
Repeated parsing
Solution
Dynamic programming algorithm can solve this problem by constructing a table containing solution to sub-
problems
Earley Parser (Top down)
CYK Parser (Bottom Up)
CKY ALGORITHM ( Cocke–Younger–Kasami algorithm )
It is one of the earliest recognition and parsing algorithms. The standard version of CKY can only
recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). It is also possible to
extend the CKY algorithm to handle some grammars which are not in CNF
Observation: any portion of the input string spanning i to j can be split at k, and structure can then be built
using sub-solutions spanning i to k and sub-solutions spanning k to j .
VP
S
VP PP
S NP NP
NP V, VP Det. N P Det N
SEMANTIC ANALYSIS
It involves mapping of natural language occurrences to some representation of meaning. A meaning
representation language bridges the gap between linguistic and common sense knowledge.
Example:
A cup on the table on( cup, table)
Air India serves Hyderabad serves(Air India, Hyderabad)
Example of meaning representation language are
First Order Predicate Calculus
Semantic Network
Conceptual Dependencies
Characteristics of Meaning representation language
1. Verifiable: We should be able to identify the meaning and also verify the meaning.
2. Unambiguous: There should not be any ambiguity. There should be only one representation
of the sentence.
3. Support canonical form
Does Air India serves Hyderabad?- serves(x,y)
Does Air India offers a flight to Hyderabad?- offers(x,y)
Does Air India have a flight to Hyderabad?- have (x,y)
4. Support inference and variables
5. Expressiveness: A wide range of domain must be served.
Meaning representation can be created and assigned to linguistic units using two approaches
(i). Syntax driven semantic analysis
(ii) Semantic Analysis
(i) Syntax driven semantic analysis
It uses syntactic constituents of a sentence to build its meaning representation.
Example: President nominates speaker
NP VP
Noun Verb NP
(President) (nominates)
Noun
(speaker)
It is based on the “principle of compositionality” that states that meaning of whole sentence can be
derived from the meaning of its parts. In reality the actual meaning of sentence can be derived from constituent
meaning, relationship between words, context, domain, word order and real world knowledge.
The old man finally kicked the bucket- The man finally died (idiom)
All roads lead to Rome- There are many ways to do a task. (idiom)
Disadvantages
It depends upon traditional grammar rules which aren’t meant for semantic processing. It only provides the
abstract information which many times does not fit the semantic meaning.
(ii) Semantic Grammar
Example: I want to go from Delhi to Chennai on 24th March.
Info request: User wants to go to city from city Time Expr
Drawback
o It is domain dependent. Whenever we are adding new domain we have to write new rule.
o It focusses on building semantic capabilities in to grammar
o Here, rules and constituents corresponds directly to entities and activities in the domain
o Rules are domain specific, therefore it is required to develop new rules for each domain
Example: Indian Hotel and Chinese Food We need to write different rules for this phase because they belong
to two different context.
Applications of semantic analysis
Information retrieval
Machine translation
Question answering system
Sentimental analysis
Text classification
Text analysis
Importance of semantic analysis
It is concerned with understanding the actual meaning by extracting contextual information.
Example: Feeling hungry nearby restaurants
Word net: It is a large lexical data base for English language that list all senses of a word.
Relationships are 4 types namely:
Homonymy: words with different meaning
Polysemy: Many possible meaning for a word
Antonymy: opposite of a word
Synonymy: Many word for single meaning
LANGUAGE MODELS
A Language Model is a probabilistic model which predicts the probability that a sequence of tokens belongs
to a language.
The probabilities returned by a language model are mostly useful to compare the likelihood that different
sentences are "good sentences".
Spell checking, Automatic Speech Recognition, Machine Translation:
N-grams
o N-grams are the simplest tool available to construct a language model.
o An N-gram is a sequence of N words.
o An N-gram model predicts the probability of a given N-gram within any sequence of words
in the language.
P(s) = P(w1,w2,w3……..wn)
The chain rule of probability is an exact rewriting of a joint-
probability:
p(w1...wn) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)
P(w1w2…wn ) = ∏ P(wi | w1w2…wn−1)
i
1.The Arabian Knights 2.These are the fairy tales of 3. The Stories of the Arabian
the eat Knights are translated primary
languages.
In a unigram model, the probability of each word depends on its own probability in the corpus.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑊𝑛 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
P(wn)=𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑟𝑝𝑢𝑠
If total number of words in the corpus is equal to 10000 and ‘the’ appears 7000 times, then
prob(the)=7000/10000=0.7=70%
‘zero probability’ one n gram cause a zero probability of the entire sentence. It can be avoided by
“Smoothing n-gram models”.
INFORMATION RETRIEVAL
Information retrieval is the task of documents that are relevant to a user’s need for information. The best-
known examples of information retrieval systems are search engines on the World Wide Web.
List of documents
User User Query Query IR System related to document
Processing
query
Document
Repository
A user requiring any information formulates a request in the form of query written in Natural Language. The
retrieval system responds by retrieving the document that seems to be relevant. Information retrieval deals
with organization, storage, retrieval and evaluation of information related to users query.
An IR model involves the following activities.
Document Indexing
Query Processing
Document matching and Ranking
1. Document Indexing
In IR system the actual text of the document isn’t used in the retrieval process. Documents are represented
through a set of index terms or keywords. These index terms are used for accessing the required documents.
Document indexing can be done automatically by using the context used in the document or it can be done
manually where the set of keywords are defined by the creator of the document.
2. Query processing
It involves finding the keywords or index terms in the user query which are used for retrieval process. The user
entered queries are processed to remove the words which does not constitute to the retrieval process.
Stop word removal: Stop words are high frequency word which have less semantic weight and are
thus not helpful in the retrieval process. These words play important grammatical roles in languages
such as in the formation of phrases but does not contribute to the semantic context of a document in a
keyword based representation. Eliminating stop word considerably reduces the number of index terms.
Example: an, a , the, form, also, for
Stemming: Stemming focusses on removing the affixes.
Example: programs, programming, programmer
Lemmatization: Interested to find the root word present in the dictionary.
3. Document Matching and Ranking
It involves finding the relevant documents from the document repository that may contain the answer to the
users query. It also rank them as per its relevance to the users query.
IR scoring functions
A scoring function takes a document and a query and returns a numeric score; the most relevant documents
have the highest scores. In the BM25 function, the score is a linear weighted combination of scores for each
of the words that make up the query.
Three factors affect the weight of a query term:
1. First, the frequency with which a query term appears in a document. For the query [farming in Kansas],
documents that mention “farming” frequently will have higher scores.
2. Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost every
document, so it has a high document frequency, and thus a low inverse document frequency, and thus
it is not as important to the query as “farming” or “Kansas.”
3. Third, the length of the document. A million-word document will probably mention all the query words,
but may not actually be about the query. A short document that mentions all the words is a much better
candidate.
The BM25 function takes all three of these into account. TF(qi, dj), the count of the number of times word qi
appears in document dj. DF(qi), that gives the number of documents that contain the word qi.
where, |dj | is the length of document dj in words, and L is the average document length in the corpus: L =
|𝑑 |
∑𝑖 𝑖 |. The two parameters, k and b, that can be tuned by cross-validation;
𝑁
It would be impractical to apply the BM25 scoring function to every document in the corpus. Instead,
systems create an index ahead of time that lists, for each vocabulary, the documents that contain the word. This
is called the hit list for the word. Then when given a query, we intersect the hit lists of the query words and
only score the documents in the intersection.
Evaluation of IR system
Imagine that an IR system has returned a result set for a single query, for which we know which documents
are and are not relevant, out of a corpus of 100 documents.
In result set Not in result set
Relevant 30 20
Not-relevant 10 40
Evaluation of IR system can be done on the following points
Time between submission of query and getting the response.
Presentation format
User effort to get the information
Precision measures the proportion of documents in the result set that are actually relevant.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑑
Precision= 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 =30/(30+10)= 0.75
Recall measures the proportion of all the relevant documents in the collection that are in the result set.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑑
Recall= =30/(30+20)=0.60
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
In a very large document collection, such as the World Wide Web, recall is difficult to compute, because there
is no easy way to examine every page on the Web for relevance.
where PR(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the pages that link
in to p, and C(ini) is the count of the total number of out-links on page ini. The constant d is a damping factor.
The HITS algorithm for computing hubs and authorities with respect to a query. RELEVANT-PAGES fetches
the pages that match the query, and EXPAND-PAGES adds in every page that links to or is linked from one
of the relevant pages. NORMALIZE divides each page’s score by the sum of the squares of all pages’ scores.
Given a query, HITS first finds a set of pages that are relevant to the query. It does that by intersecting
hit lists of query words, and then adding pages in the link neighborhood of these pages—pages that link to or
are linked from one of the pages in the original relevant set. Each page in this set is considered an authority on
the query to the degree that other pages in the relevant set point to it. A page is considered a hub to the degree
that it points to other authoritative pages in the relevant set. With this PageRank, we don’t want to merely
count the number of links; we want to give more value to the high-quality hubs and authorities. Thus, as with
PageRank, we iterate a process that updates the authority score of a page to be the sum of the hub scores of the
pages that point to it, and the hub score to be the sum of the authority scores of the pages it points to. If we
then normalize the scores and repeat k times, the process will converge. Both PageRank and HITS played
important roles in developing our understanding of Web information retrieval.
INFORMATION EXTRACTION
Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects. A typical task is to extract
instances of addresses from Web pages, with database fields for street, city, state, and zip code; or instances
of storms from weather reports, with fields for temperature, wind speed, and precipitation. In a limited domain,
this can be done with high accuracy.
1. Finite-state automata for information extraction
The simplest type of information extraction system is an attribute-based extraction system that assumes
that the entire text refers to a single object and the task is to extract attributes of that object.
For example, the problem of extracting from the text “IBM ThinkBook 970. Our price: $399.00” the
set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}. We can address this problem
by defining a template (also known as a pattern) for each attribute we would like to extract. The template is
defined by a finite state automaton, the simplest example of which is the regular expression, REGULAR
EXPRESSION or regex.
Here we show how to build up a regular expression template for prices
[0-9] matches any digit from 0 to 9
[0-9]+ matches one or more digits
[.][0-9][0-9] matches a period followed by two digits
One step up from attribute-based extraction systems are relational extraction systems, RELATIONAL
EXTRACTION which deal with multiple objects and the relations among them. . Thus, when these systems
see the text “$249.99,” they need to determine not just that it is a price, but also which object has that price. A
relational extraction system can be built as a series of cascaded finite-state transducers. That is, the system
consists of a series of small, efficient finite-state automata (FSAs).
FASTUS consists of five stages:
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
FASTUS’s first stage is tokenization, which segments the stream of characters into tokens (words, numbers,
and punctuation). The second stage handles complex words, including collocations such as “set up” and “joint
venture”. For example, a company name might be recognized by the rule
CapitalizedWord+ (“Company” | “Co” | “Inc” | “Ltd”)
The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk these into
units that will be managed by the later stages.
Example: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf clubs to be shipped to Japan.
The fourth stage combines the basic groups into complex phrases. Again, the aim is to have rules that
are finite-state and thus can be processed quickly, and that result in unambiguous output phrases. One type of
combination rule deals with domain-specific events. For example, the rule
Company+ SetUp JointVenture (“with” Company+)?
captures one way to describe the formation of a joint venture. The final stage merges structures that were built
up in the previous step.
In general, finite-state template-based information extraction works well for a restricted domain. Finite-state
information extraction is less successful at recovering information in highly variable format, such as text
written by humans on a variety of subjects.
2. Probabilistic models for information extraction
The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or HMM.
For example, here is a brief text and the most probable (Viterbi) path for that text for two HMMs, one trained
to recognize the speaker in a talk announcement, and one trained to recognize dates. The “-” indicates a
background state:
HMMs have two big advantages over FSAs for extraction. First, HMMs are probabilistic, and thus tolerant
to noise. In a regular expression, if a single expected character is missing, the regex fails to match; with HMMs
there is graceful degradation with missing characters/words, and we get a probability indicating the degree of
match, not just a Boolean match/fail. Second, HMMs can be trained from data; they don’t require laborious
engineering of templates, and thus they can more easily be kept up to date as text changes over time.
3. Conditional random fields for information extraction
An HMM is a generative model; it models the full joint probability of observations and hidden states, and
thus can be used to generate samples. That is, we can use the HMM model not only to parse a text and recover
the speaker and date, but also to generate a random instance of a text containing a speaker and a date.
We can define a simple feature function, for example one that produces a value of 1 if the current word is
ANDREW and the current state is SPEAKER:
Parameter values can be set manually or can be learned from data. Now consider a second feature function:
This feature is true if the current state is SPEAKER and the next word is “said”. Features can also be
defined over transitions between states. The features we defined here were binary, but in general, a feature
function can be any real-valued function.
4. Ontology extraction from large corpora
A different application of extraction technology is building a large knowledge base or ontology of facts
from a corpus. Here is one of the most productive templates:
NP such as NP (, NP)* (,)? ((and | or) NP)? .
Here the bold words and commas must appear literally in the text, but the parentheses are for grouping,
the asterisk means repetition of zero or more, and the question mark means optional. NP is a variable standing
for a noun phrase;
This template matches the texts “diseases such as rabies affect your dog” and “supports network
protocols such as DNS,” concluding that rabies is a disease and DNS is a network protocol. Similar templates
can be constructed with the key words “including,” “especially,” and “or other.” Of course these templates
will fail to match many relevant passages, like “Rabies is a disease.” That is intentional. The “NP is a NP”
template does indeed sometimes denote a subcategory relation, but it often means something else, as in “There
is a God” or “She is a little tired.”
5. Automated template construction
It is possible to learn templates from a few examples, then use the templates to learn more examples, from
which more templates can be learned, and so on. . In one of the first experiments of this kind, Brin (1999)
started with a data set of just five examples:
Clearly these are examples of the author–title relation, but the learning system had no knowledge of
authors or titles. The words in these examples were used in a search over a Web corpus, resulting in 199
matches. Each match is defined as a tuple of seven strings,
(Author, Title, Order, Prefix, Middle, Postfix, URL),
where, Order is true if the author came first and false if the title came first, Middle is the characters between
the author and title, Prefix is the 10 characters before the match, Suffix is the 10 characters after the match,
and URL is the Web address where the match was made.
In the experiment run by Brin, the first 199 matches generated three templates. The most productive template
was
<L1><B> Title</B> by author {URL: www.sff.net/locus/c}
The examples were then used to generate more templates, and so on, eventually yielding over 15,000 titles.
Given a good set of templates, the system can collect a good set of examples. Given a good set of examples,
the system can build a good set of templates.
The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is
incorrect, errors can propagate quickly. One way to limit this problem is to not accept a new example unless
it is verified by multiple templates.
6. Machine reading
Machine reading system is an extraction system with no human input of any kind and it could read on its own
and build up its own database. Such a system would be relation-independent; would work for any relation. A
representative machine-reading system is TEXTRUNNER and it uses co-training to boost its performance, but
it needs something to bootstrap from. For TEXTRUNNER, the original inspiration was a taxonomy of eight
very general syntactic templates, as shown in Figure
Given a set of labeled examples of this type, TEXTRUNNER trains a linear-chain CRF to extract further
examples from unlabeled text. TEXTRUNNER achieves a precision of 88% and recall of 45% on a large Web
corpus. TEXTRUNNER has extracted hundreds of millions of facts from a corpus of a half-billion Web pages.
MACHINE TRANSLATION
Machine translation is a sub-field of computational linguistics that investigates the use of software to translate
text or speech from one language to another.
Applications
o Language interpretation
o Website translation
o Document translation
o Speech translation
o Mobile translation.
Issues: Due to structural variation of different languages, automatic translation is a difficult and challenging
task.
Word order: arrangement of words varies across languages
Deepika reads a book
Subject Verb Object
தீபிகா புத்தகம் படித்தாள்
Subject Object Verb
Word Sense: sense of a word in one language may be different from other language
Book a room
In Tamil Book means “புத்தகம்”. It leads to incorrect machine translation.
Reference Resolution: Resolving references is important for machine translation. Unresolved
references may lead to incorrect translation.
Sudha forget her pend rive in lab. Here, her refers Sudha
Idioms: Old man finally kicked the bucket.
The actual meaning is “The old man finally died”. If we translate as such, it leads to incorrect machine
translation.
Ambiguity: Same word have different meaning
SL –TL
Dictionary
In morphological analysis stage, the morphological inflexion from the words are removed to obtain the root
form of source language words.
Example: gave give
In bilingual dictionary look-up stage, the target language words corresponding to the source language words
are systematically retrieved from a bilingual dictionary.
Example: food in English
"உணவு" in Tamil
In syntactic rearrangement stage, the word order are changed to best match the word order of target language.
Example:
The boy plucked the flower. ( In English)
Subject Verb Object
Advantage
Easy and fast.
Disadvantage
It does not attempt to disambiguate words therefore the quality of the output is not very satisfactory.
Word by word translation may make it difficult to get the correct meaning
It does not consider the structure and relationship between words.
It is developed for a specific language pair and cannot be adopted for a different pair of language.
SL Grammar TL Grammar
SL –TL Dictionary
and Grammar
Example:
The boy plucked the flower. ( In English)
Subject Verb Object
Paiyam Poovai Parithan. (In Tamil)
Subject Object Verb
Advantages
The analysis of source language ids independent of target language
It can handle ambiguities
The same grammar rules can be used for analysis as well as generation
It is more realistic, flexible and can be extended to language pairs in a multi-lingual environment
Interlingua
SL1 Text Analysis representation Synthesis TL1 Text
Advantages
The same analysis component may be used for more than one target language.
Disadvantages
Defining universal abstract/ interlingua representation which preserves the meaning of a sentence is
extremely difficult and is a challenging task.
3. Corpus based machine translation
Corpus based machine translation has two types
(i) Example based machine translation
(ii) Statistical based machine translation
Example
pattern
SL Retrieval Adaptation TL Sentence
Sentence
This system is inspired by the noisy channel model. A noisy channel introduces noise that make it difficult to
recognize the input words. The recognition system is a model of the channel to recognize the original words.
The input is considered as the distorted function of the target language sentences. The task is to find out the
most likely source language sentence given the translation. We make the assumption that each phrase
translation and each distortion is independent of the others, and thus we can factor the expression as
The program learns by refining candidate description of the target concept through generalization and
specialization.
Generalization changes the candidate description to let it accommodate new positive examples
(b) Given background knowledge that bricks and pyramids are both polygons
Specialization changes the candidate description to exclude near misses
Performance of learning algorithm is highly sensitive to the quality and order of the training examples
xi are inputs,
wi are weights,
wn is usually set for the threshold with xn =-1 (bias),
y is the weighted sum of inputs including the threshold (activation level)
o is the output. The output is computed using a function that determines how far the perceptron’s activation
level is below or above 0
Training perceptrons
We can train perceptrons to compute the function of our choice
• Start with a perceptron with any values for the weights (usually 0)
• Feed the input, let the perceptron compute the answer
• If the answer is right, do nothing
• If the answer is wrong, then modify the weights by adding or subtracting the input vector (perhaps
scaled down)
• Iterate over all the input vectors, repeating as necessary, until the perceptron learns what we want
Perceptron Learning
Developed by Frank Rosenblatt by using McCulloch and Pitts model, perceptron is the basic
operational unit of artificial neural networks. It employs supervised learning rule and is able to classify the
data into two classes.
Perceptron thus has the following three basic elements –
Links−It would have a set of connection links, which carries a weight including a bias always having weight 1
Adder − It adds the input after they are multiplied with their respective weights.
Activation function − It limits the output of neuron. The most basic activation function is a Heaviside step
function that has two possible outputs. This function returns 1, if the input is positive, and 0 for any negative
input.
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight coefficient; an
output value ”f(x)”is generated.
Multilayer Network
A multilayer network has two or more layers of units, with the output from one layer serving as input
to the next. Generally in a multilayer network there are 3 layers present like, input layer, output layer and
hidden layer. The layer with no external output connections are referred to as hidden layers. A multilayer
neural network structure is given in figure.
Feed Forward neural network
In this network, the information moves in only one direction, forward from the input nodes, through
the hidden nodes and to the output nodes. There are no cycles or loops in the network. In other way we can
say the feed forward neural network is one that does not have any connections from output to input. All inputs
with variable weights are connected with every other node. A single layer feed forward network has one layer
of nodes, whereas a multilayer feed forward network has multiple layers of nodes. The structure of a feed
forward multilayer network is given in figure.
Step1: Build a network with the chosen number of input, hidden and output u n i t s .
Step2: Initialize all the weights to low random values.
Step3: Randomly, choose a single training pair.
Step4: Copy the input pattern to the input layer.
Step5: Cycle the network so that the activation from the inputs generates the activations in the hidden and
output layers.
Step6: Calculate the error derivative between the output activation and the final o u t p u t .
Step7: Apply the method of back propagation to the summed products of the weights and errors in the
output layer in order to calculate the error in the hidden units.
Step8: Update the weights attached the each unit according to the error in that unit, the output from the
unit below it and the learning parameters, until the error is sufficiently low
Boltzmann Machines
A Boltzmann machine is the variation on the idea of Hopfield network. The main problem with Hopfield
network is that they settle in to local minima. Having many local minima is good for building content-
addressable memories. But for constraint satisfaction tasks, we need to find the globally optimal state of the
network. Boltzmann machine is the combination of simulated annealing and Hopfield network.
Units in Boltzmann machine update their individual binary states by stochastic rather than deterministic rule.
The probability that any given unit will be active is given by p
where is the sum of the unit’s active input lines and T is the temperature of the network.
Reinforcement Learning
Reinforcement learning is close to human learning. Algorithm learns a policy of how to act in a given
environment. Every action has some impact in the environment, and the environment provides rewards that
guides the learning algorithm. Reinforcement learning deals with agents that must sense and act upon their
environment.
Network learning procedure
i. The network presented with a sample input from the training set.
ii. The network computes what it thinks should be the sample output
iii. The network is supplied with a real-valued judgment by the teacher
iv. The network adjusts its weights and the process repeats.
v. A positive value in step (iii) indicates good performance, while the negative value indicates
bad performance.
Unsupervised learning
Unsupervised learning, no labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a
means towards an end. Given a set of input data, the network is allowed to play with it to try to discover
regularities and relationships between the different parts of the input. One popular unsupervised learning
scheme is known as competitive learning.
i. Present an input vector
ii. Calculate the initial activation for each output unit
iii. Let the output units fight until only one is active
iv. Increase the weights on connections between the active input units. This makes more likely
that the output unit will be time the pattern is repeated.