Aho–Corasick
Algorithm
Mostafa Saad Ibrahim
Teaching Assistant @ Cairo University
Software Engineer @ NTPSoftware
Pattern Matching
Problem: Given string S, set of pattern Ps, find all Ps in S
S is of size m
Total Length of Ps is n
S = ababcdefefab, Ps = {ab, bab, def, fe}
3 times ab => ababcdefefab
1 time bab => ababcdefefab
1 time def => ababcdefefab
2 times ef => ababcdefefab
Aho Prerequisites
KMP Failure Function
KMP has failure for one string, Aho has failure for N strings
Trie (Letter Tree)
KMP find one pattern, in Aho, one pattern may be related
with other patterns, we need to set them in trie
Extend Trie to handle strings relations
BFS (Breadth First Search)
Algorithm Process the strings in increasing length
Complexity: O(m + n + k), where k total # of occurrences
Remember Failure function
Given String P of length m, define an array F[m]:
Let t = P[0..i]
F[i] = length of longest proper prefix of t that is suffix of t
Same
F[i] = length of prefix of t that is longest proper suffix of t
Calculation are same
Last one better for Aho Algorithm
Remember: “Failure” of ababcaba
0 0 1 2 0 1 2 3
If failed to match after 6th position,
How to extend this linear processing Fail and try to match from 2nd position
to set of strings?
a b a b c a b a
Aho Algorithm
Similar to KMP, process patterns first, then later match
Construct Trie Given the patterns
Node will have vector: Set of the patterns such that a proper
suffix of the current string is prefix of the pattern
Let node N (string Pi) fails on node S (string Pj): As
longest proper suffix of Pi matches prefix of Pj.
Use BFS to Calc Failure of each node. All nodes in level
1 fail on root.
Construct Trie: hers, his, he, she
Image Credit: http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
Build Trie Failure Function
Level 1 nodes (1, 3)
Must fail on root
If we fail on node X
Add its patterns to me
Investigate by yourself
Image Credit:
http://ja.wikipedia.org/wiki/%E3%82%A8%E3%82%A4%E3%83
%9B-
%E3%82%B3%E3%83%A9%E3%82%B7%E3%83%83%E3%82%
AF%E6%B3%95
Investigate by yourself
Image Credit:
http://d.hatena.ne.jp/naoya/20090405/aho_corasick
Trie Node Normal trie should have:
Node *child[], vector<char>, Boolean
What is new?
Each node has node* fail
This allow node fail on other node
Each node has vector<int>, this for a
Patterns that has prefix matches
a proper suffix of current node string
Each node has a sequential ID
This helps in solving problems
BFS to build Failure
Why BFS?
It just allow faster implementation
Say node X fail on node Y, Which fail on node Z
Normally String of X > Y > Z
X should have longest prefixes of Y and Z + its
pattern
If we build level by level, then Z build its
patterns list
Then when Y fails on Z, it takes Z list.
Also when X fails on Y, it takes Y list, which has
Z list inside it.
BFS Initialization
Insert N patterns in the trie.
For each direct child node of root, let it fails
on root and insert in queue
BFS Initialization
In level 1, nodes 3 and 1 points to root
Queue: 3 1
Patterns start with h and s only
Any other letter will always be ignored = go to root
BFS to build Failure
Pick a node from queue, for each child
Compute where does it fail and push it in the queue
Child Node Failure: Let its parent fail once. Then as
long as can’t add child char fail
Trace From Level 2
In last step: Queue: {3, 1}
Pick 3, it has 1 child: 4…Process it
Cur = 3
Character to Child = h
Let k = failure of parent = 0, the root
Move from root using h => k = node 1
Make sense: String SH, has H as suffix of it
Now Add Failure Link from child (4) to K (1)
Add child (4) to queue
Queue now has: {1, 4}
Level 2 is done
After Level 2 processing
Queue: 4 6 2
Trace From Level 3
In last step: Queue: {4, 6, 2}
Pick 4, it has 1 child: 5…Process it
Cur = 4
Character to Child = e
Let k = failure of parent = 1
Move from root using e => k = node 2
Make sense: String SHE, has HE as suffix of it
Now Add Failure Link from child (5) to K (2)
Get List of node 2, and add it to node 5
Add child (5) to queue, Queue now has: {6, 2, 5}
And so on
Add list of 2
To list of 5
Matching Patterns in S
This is similar to KMP. We use the fail link
instead of the array.