Research and Implementation of Exact String Matchiing Algorithms
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||string matching textsearch Bit-Parallelism automaton|
String matching is an important basic research field in computer science and it occurs naturally as part of text processing, data compression, search engine, computational biology and network security. The type of pattern matching discussed in this paper is exact string matching. Topics involved:This paper first analyses the TextSearch module in Linux kernel, ameliorates its deficiencies, and extracts TextSearch from Linux kernel, and released as LibTextSearch, an independent third-party open source library. LibTextSearch runs in user space, and has more algorithms than TextSearch, provides users wider selection.We presented LNDM algorithm, which, based on BNDM, uses an overlap wide window to scan text. LNDM first scans backward for pattern prefix from middle window, then forward for suffix. Excavating maximum useful information of current window, LNDM loses less information when shifting window, and thus more efficient. Theoretical analysis shows LNDM has optimal time complexities in the worst (O(n)), best (O(n/m)) and average cases (O(n logσ(m)/m)). Experimental comparison of LNDM with existing algorithms validates our theoretical claims for searching short patterns on big alphabet.We also proposed a NBMA(Nondeterministic Boyer-Moore Automaton) algorithm, which combines BMA and Bit-Parallelism technique. BMA matches pattern with a BM automaton, whose number of states has an exponential proportion with the pattern size, and therefore seldom used. While using Bit-Parallelism to simulate a nondeterministic automaton, NBMA avoids the issue above. Also we experimented on NBMA to verify its time optimal.Finally, we presented an automated experiment platform for string matching, which, with a series of script toolkit to generate random text and patterns, provides highly flexible and configurable random string matching experiments. The Platform uses gporf to evaluate exact running time of individual algorithm, enables user to experiment on random pattern matching convieniently.