[go: up one dir, main page]

0% found this document useful (0 votes)
13 views20 pages

Algorithms - String Processing - 04 - Aho-Corasick

Uploaded by

anshuman singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Algorithms - String Processing - 04 - Aho-Corasick

Uploaded by

anshuman singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Aho–Corasick

Algorithm

Mostafa Saad Ibrahim


Teaching Assistant @ Cairo University
Software Engineer @ NTPSoftware
Pattern Matching
Problem: Given string S, set of pattern Ps, find all Ps in S
S is of size m
Total Length of Ps is n

S = ababcdefefab, Ps = {ab, bab, def, fe}


3 times ab => ababcdefefab
1 time bab => ababcdefefab
1 time def => ababcdefefab
2 times ef => ababcdefefab
Aho Prerequisites
KMP Failure Function
KMP has failure for one string, Aho has failure for N strings
Trie (Letter Tree)
KMP find one pattern, in Aho, one pattern may be related
with other patterns, we need to set them in trie
Extend Trie to handle strings relations
BFS (Breadth First Search)
Algorithm Process the strings in increasing length
Complexity: O(m + n + k), where k total # of occurrences
Remember Failure function
Given String P of length m, define an array F[m]:
Let t = P[0..i]
F[i] = length of longest proper prefix of t that is suffix of t
Same
F[i] = length of prefix of t that is longest proper suffix of t

Calculation are same

Last one better for Aho Algorithm


Remember: “Failure” of ababcaba
0 0 1 2 0 1 2 3

If failed to match after 6th position,


How to extend this linear processing Fail and try to match from 2nd position
to set of strings?

a b a b c a b a
Aho Algorithm
Similar to KMP, process patterns first, then later match
Construct Trie Given the patterns
Node will have vector: Set of the patterns such that a proper
suffix of the current string is prefix of the pattern
Let node N (string Pi) fails on node S (string Pj): As
longest proper suffix of Pi matches prefix of Pj.
Use BFS to Calc Failure of each node. All nodes in level
1 fail on root.
Construct Trie: hers, his, he, she

Image Credit: http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf


Build Trie Failure Function

Level 1 nodes (1, 3)


Must fail on root

If we fail on node X
Add its patterns to me
Investigate by yourself
Image Credit:
http://ja.wikipedia.org/wiki/%E3%82%A8%E3%82%A4%E3%83
%9B-
%E3%82%B3%E3%83%A9%E3%82%B7%E3%83%83%E3%82%
AF%E6%B3%95
Investigate by yourself
Image Credit:
http://d.hatena.ne.jp/naoya/20090405/aho_corasick
Trie Node Normal trie should have:
Node *child[], vector<char>, Boolean

What is new?

Each node has node* fail


This allow node fail on other node

Each node has vector<int>, this for a


Patterns that has prefix matches
a proper suffix of current node string

Each node has a sequential ID


This helps in solving problems
BFS to build Failure
Why BFS?
It just allow faster implementation
Say node X fail on node Y, Which fail on node Z
Normally String of X > Y > Z
X should have longest prefixes of Y and Z + its
pattern
If we build level by level, then Z build its
patterns list
Then when Y fails on Z, it takes Z list.
Also when X fails on Y, it takes Y list, which has
Z list inside it.
BFS Initialization
Insert N patterns in the trie.
For each direct child node of root, let it fails
on root and insert in queue
BFS Initialization
In level 1, nodes 3 and 1 points to root
Queue: 3 1

Patterns start with h and s only


Any other letter will always be ignored = go to root
BFS to build Failure
Pick a node from queue, for each child
 Compute where does it fail and push it in the queue
 Child Node Failure: Let its parent fail once. Then as
long as can’t add child char fail
Trace From Level 2
In last step: Queue: {3, 1}
Pick 3, it has 1 child: 4…Process it
Cur = 3
Character to Child = h
Let k = failure of parent = 0, the root
Move from root using h => k = node 1
 Make sense: String SH, has H as suffix of it
Now Add Failure Link from child (4) to K (1)
Add child (4) to queue
Queue now has: {1, 4}
Level 2 is done
After Level 2 processing
Queue: 4 6 2
Trace From Level 3
In last step: Queue: {4, 6, 2}
Pick 4, it has 1 child: 5…Process it
Cur = 4
Character to Child = e
Let k = failure of parent = 1
Move from root using e => k = node 2
 Make sense: String SHE, has HE as suffix of it
Now Add Failure Link from child (5) to K (2)
Get List of node 2, and add it to node 5
Add child (5) to queue, Queue now has: {6, 2, 5}
And so on

Add list of 2
To list of 5
Matching Patterns in S
This is similar to KMP. We use the fail link
instead of the array.

You might also like