Daa Unit 5
Daa Unit 5
BASIC CONCEPTS
In this chapter we are concerned with the distinction between problems that can be solved by a
polynomial time algorithm and problems for which no polynomial time algorithm is known. It
is an unexplained phenomenon that for many of the problems we know and study, the best
algorithms for their solutions have computing times that cluster into two groups.The first group
consists of problems whose solution times are bounded by polynomials of small degree The
second group is made up of problems whose best-known algorithms are non polynomial In the
quest to develop efficient algorithms,no one has been able to develop a polynomial time
algorithm for any problem in the second group.
The theory of NP-complete doesn’t provide a method of obtaining polynomial time algorithms
for problems in the second group. Nor does it say that algorithms of this complexity do not
exist
Instead,what we do is show that many of the problems for which there are no known
polynomial time algorithms are computationally related.In fact, we establish two classes of
problems.These are given the names NP-hard and NP-complete.A problem that is NP-complete
has the property that it can be solved in polynomial time if and only if all other NP-complete
problems can also be solved in polynomial time. If an NP-hard problem can be solved in
polynomial time,then all NP-complete problems can be solved in polynomial time. All NP-
complete problems are NP-hard, but some NP-hard problems are not known to be NP-complete
Some of the terms related to the non-deterministic algorithm are defined below:
choice(X) : chooses any value randomly from the set X.
failure() : denotes the unsuccessful solution.
success() : Solution is successful and current thread terminates.
Example :
Problem Statement : Search an element x on A[1:n] where n>=1, on successful search return j
if a[j] is equals to x otherwise return 0.
Non-deterministic Algorithm for this problem :
j= choice(a, n)
if(A[j]==x) then
{
write(j);
success();
}
write(0); failure();
To solve this problem, it do not have to be in NP . To solve this problem, it must be both NP
and NP-hard problems.
NP-hard NP-Complete
Example: Halting problem, Vertex cover problem, Example: Determine whether a graph has
etc. a Hamiltonian cycle, Determine whether
a Boolean formula is satisfiable or not,
Circuit-satisfiability problem, etc.
STRING MATCHING:
The string-matching is a very useful tool which is used in string matching algorithm. It
examines every character in the text exactly once and reports all the valid shifts in O (n) time.
The goal of string matching is to find the location of specific text pattern within the larger body
of text (a sentence, a paragraph, a book, etc.)
Let us look at a few string matching algorithms before proceeding to their applications in real
world. String Matching Algorithms can broadly be classified into two types of algorithms –
1.Exact String Matching Algorithms
2.Approximate String Matching Algorithms
Exact String Matching Algorithms:
Exact string matching algorithms is to find one, several, or all occurrences of a defined string
(pattern) in a large string (text or sequences) such that each matching is perfect. All alphabets
of patterns must be matched to corresponding matched subsequence. These are further
classified into four categories:
Algorithms based on character comparison:
Naive Algorithm: It slides the pattern over text one by one and check for a match. If a match is
found, then slides by 1 again to check for subsequent matches.
KMP (Knuth Morris Pratt) Algorithm: The idea is whenever a mismatch is detected, we
already know some of the characters in the text of the next window. So, we take advantage of
this information to avoid matching the characters that we know will anyway match.
Boyer Moore Algorithm: This algorithm uses best heurestics of Naive and KMP algorithm and
starts matching from the last character of the pattern.
Using the Trie data structure: It is used as an efficient information retrieval data structure. It
stores the keys in form of a balanced BST.
Deterministic Finite Automaton (DFA) method:
Automaton Matcher Algorithm: It starts from the first state of the automata and the first
character of the text. At every step, it considers next character of text, and look for the next
state in the built finite automata and move to a new state.
Algorithms based on Bit (parallelism method):
Aho-Corasick Algorithm: It finds all words in O(n + m + z) time where n is the length of text
and m be the total number characters in all words and z is total number of occurrences of words
in text. This algorithm forms the basis of the original Unix command fgrep.
Hashing-string matching algorithms:
Y.Jyothi , Assistant Professor, IT Department, Dhanekula Institute of Engineering & Technology 5
R19 IT III YEAR 1 SEMESTER DESIGN AND ANALYSIS OF ALGORITHMS UNIT V
Rabin Karp Algorithm: It matches the hash value of the pattern with the hash value of current
substring of text, and if the hash values match then only it starts matching individual characters.
Approximate String Matching Algorithms:
Approximate String Matching Algorithms (also known as Fuzzy String Searching) searches for
substrings of the input string. More specifically, the approximate string matching approach is
stated as follows: Suppose that we are given two strings, text T[1…n] and pattern P[1…m].
The task is to find all the occurrences of patterns in the text whose edit distance to the pattern is
at most k. Some well known edit distances are – Levenshtein edit distance and Hamming edit
distance.
These techniques are used when the quality of the text is low, there are spelling errors in the
pattern or text, finding DNA subsequences after mutation, heterogeneous databases, etc. Some
approximate string matching algorithms are:
Naive Approach: It slides the pattern over text one by one and check for approximate matches.
If they are found, then slides by 1 again to check for subsequent approximate matches.
Sellers Algorithm (Dynamic Programming)
Shift or Algorithm (Bitmap Algorithm)
Applications of String Matching Algorithms:
Plagiarism Detection: The documents to be compared are decomposed into string tokens and
compared using string matching algorithms. Thus, these algorithms are used to detect
similarities between them and declare if the work is plagiarized or original.
Digital Forensics: String matching algorithms are used to locate specific text strings of interest
in the digital forensic text, which are useful for the investigation.
Y.Jyothi , Assistant Professor, IT Department, Dhanekula Institute of Engineering & Technology 6
R19 IT III YEAR 1 SEMESTER DESIGN AND ANALYSIS OF ALGORITHMS UNIT V
Spelling Checker: Trie is built based on a predefined set of patterns. Then, this trie is used for
string matching. The text is taken as input, and if any such pattern occurs, it is shown by
reaching the acceptance state.
Spam filters: Spam filters use string matching to discard the spam. For example, to categorize
an email as spam or not, suspected spam keywords are searched in the content of the email by
string matching algorithms. Hence, the content is classified as spam or not.
Search engines or content search in large databases: To categorize and organize data efficiently,
string matching algorithms are used. Categorization is done based on the search keywords.
Thus, string matching algorithms make it easier for one to find the information they ar
searching for.
Intrusion Detection System: The data packets containing intrusion-related keywords are found
by applying string matching algorithms. All the malicious code is stored in the database, and
every incoming data is compared with stored data. If a match is found, then the alarm is
generated. It based on exact string matching algorithms where each intruded packet must be
detected.
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-m+1) m but it
has a good average case running time. If the expected number of strong shifts is small O (1) and
Y.Jyothi , Assistant Professor, IT Department, Dhanekula Institute of Engineering & Technology 11
R19 IT III YEAR 1 SEMESTER DESIGN AND ANALYSIS OF ALGORITHMS UNIT V
prime q is chosen to be quite large, then the Rabin-Karp algorithm can be expected to run in
time O (n+m) plus the time to require to process spurious hits.
Drawback:Just by applying the hash function based on the table created , then sometimes even
though Hash Code matched of both pattern and Text but the characters of the Pattern dont
matches to the Text.
Solution:
Initially: m = length [p] = 7
Π [1] = 0
k=0
The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.'
Since step 1 to step 4 take constant times, the running time is dominated by this for the
loop. Thus running time of the matching function is O (n).
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:
Solution:
Y.Jyothi , Assistant Professor, IT Department, Dhanekula Institute of Engineering & Technology 14
R19 IT III YEAR 1 SEMESTER DESIGN AND ANALYSIS OF ALGORITHMS UNIT V
Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that
took place for the match to be found is i-m = 13 - 7 = 6 shifts.
SUFFIX TREES
Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees help in solving
a lot of string related problems like pattern matching, finding distinct substrings in a given
string, finding longest palindrome etc
Before going to suffix tree, let's first try to understand what a compressed trie is.
Consider the following set of strings: {"banana","nabd","bcdef","bcfeg","aaaaaa","aaabaa" }
A standard trie for the above set of strings will look like:
As it might be clear from the images show above, in a compressed trie, edges that direct to a
node having single child are combined together to form a single edge and their edge labels are
concatenated. So this means that each internal node in a compressed trie has atleast two
children. Also it has atmost N leaves, where N is the number of strings inserted in the
compressed trie. Now both the facts: Each internal node having atleast two children, and that
there are N leaves, implies that there are atmost 2N−1 nodes in the trie. So the space
complexity of a compressed trie is O(N) as compared to the O(N2) of a normal trie.
So that is one reason why to use compressed tries over normal tries.
Before going to construction of suffix trees, there is one more thing that should be understood,
Implicit Suffix Tree. In Implicit suffix trees, there are atmost N leaves, while in normal one
there should be exactly N leaves. The reason for atmost N leaves is one suffix being prefix of
another suffix. Following example will make it clear. Consider the string "banana"
Implicit Suffix Tree for the above string is shown in image below:
To avoid getting an Implicit Suffix Tree we append a special character that is not equal to any
other character of the string. Suppose we append $ to the given string then, so the new string
is "banana$". Now its suffix tree will be
Y.Jyothi , Assistant Professor, IT Department, Dhanekula Institute of Engineering & Technology 21
R19 IT III YEAR 1 SEMESTER DESIGN AND ANALYSIS OF ALGORITHMS UNIT V