Implementation of Aho-Corasick (wiki) algorithm with Double Array Trie data structure in C. Project provides search using socket and C library. The automaton can store whole Unicode alphabet. User input must be UTF8 encoded strings (they are encoded into code points internally). Implementation contains functions for storing the automaton in binary file. Project uses cmake with pkg-config. Example of usage can be found in example directory.
Aho-Corasick (AC) is text search algorithm which uses deterministic finite automaton.
It's based on idea of KMP algorithm.
It can search text in linear time O(n+m+c)
where n is the length of the strings, m is the length of the searched text and c is the count of matches.
Double Array Trie (DAT) is data structure which can store trie in two arrays. Storing the trie in DAT will consume less memory than "naive" implementation with hash tables. Use of the tail is optional. Without the tail it requires only one array to store dictionary.
Automaton is assembled from the trie, which must be assembled first. Space complexity of the trie is much greater than one of the automaton. Each node in the final automaton requires 4×4 bytes of memory and contains only information about DAT's base, check, AC's fail and output. For assembling the AC automaton, BFS and DFS algorithms are implemented. An automaton can store a maximum of 2^31-1 (signed 32bit integer) states (tree nodes), so it can fit into (2^31-1)×16 ~= 34.4 GB of memory.
Tail stores the longest suffix of string which doesn't need to be branched. Characters stored in the tail are outside the AC automaton. Searching will be asymptotically slower when using tail because of the lack of fail and output functions for AC algorithm. Using tail is optional.
Additional user data can be stored with the needle in the trie. They are laying outside the trie (automaton) and their usage is optional.
The automaton search function requires bitmask which consists of four search modes.
- FIRST = return only first occurrence and stop
- EXACT = exact match of the needle in the dictionary
- NEEDLE = construct and return found needle in the dictionary
- USER_DATA = search and return user data stored with the needle
Repository contains app (cmd directory) for communication over unix or tcp socket. Handling of socket connections is build with the libevent library (uses epool on linux and kqueue on mac). The worker pool supports handling of socket connections on multiple threads in parallel (via pthreads). Count of the worker threads is provided when assembling the worker pool.
Directory client contains socket client(s) implementation. Current implementation provides only client for GO programming language. Repository also contains CLI client (see cli directory).
Socket server has four config options listed below:
- backlog = number of maximum waiting connections (see manual)
- clientTimeout = maximum duration of socket connection (in seconds)
- serverTimeout = maximum duration of waiting for closing clients connections when shutting down server (in seconds)
- handler & handlerData = socket connection handler and his data. Default is ahoCorasickHandler with automaton, tail and user data.
As described above, search in automaton supports few search modes which affects data order over socket. Client request consists of first 4 bytes (signed 32bit int) of the length of the needle then whole needle follows. Server response of one occurrence looks like this:
- Result of search (32bit int). When no occurrence, -1 is returned. Positive number is success and size of the user data stored with the needle. Zero means that the needle was successfully found, but doesn't have any user data.
- Then user data (if any) follows.
- If searching with NEEDLE mode, size of the needle (count of UTF8 bytes) and the needle itself.
- If searching with FIRST mode no other data follows. Without this mode same message is sent for the next occurrence. The whole process is then repeated.
In lib directory cmake and pkg-config configs for C library can be found. Public headers are in include directory. Implementation uses only single library (libevent). Source code also contains implementation of dynamic doubly linked list which can perform as a stack, queue or sorted array. Supports linear and binary search and merge sort.
As mentioned above, the trie index uses signed 32bit integer. Using unsigned integer can double the number of nodes (2^32-1). For doing that, the tail must be disabled and the algorithm for searching free node in the trie modified.
Storing characters in tail requires at least n×4+4+p bytes of memory, where p is the size of pointer and n the number of characters. The tail requires at least c×n+4+p bytes of memory, where c is the size of cell, n the number of the cells and where p is the size of pointer. Using the tail as single character array can reduce the amount of used memory, but it really depends on dictionary.
When building the trie, each node keeps list of outwards transition functions (characters). These lists speeds up solving collision in array and building AC automaton significantly, but it's costing time keeping them up-to-date. Using hash tables or some another data structure can noticeably reduce complexity. Another option is not keeping information about outwards transition functions but then each must be tried, so it's usefully with smaller alphabets. Unicode 14 has 144,697 characters.
Output function points to the nearest end state where we can get through fail functions. The use of the output function speeds up searching in the automaton. It requires 4 bytes of memory per each automaton state, so not using it will reduce memory usage of the automaton by a quarter.
Yeah, code is not tested. I will buy you 🍺 for founded bug.