What is Stringology?

String Matching (or “Stringology”) is one of the most widely studied problems in computer science. Part of its appeal is in its direct applicability to “real world” problems. Every person who uses a text editor, and every user of a digital library or search engine, needs to find patterns in a text. The Boyer-Moore algorithm is directly implemented the search command of practically all text editors. The longest common subsequence dynamic programming algorithm is implemented in system commands that test differences between files. The largest overlap heuristic for finding the shortest common superstring has been used in DNA sequencing, and there are many other examples.

Advances in Multimedia, Digital libraries and Computational Biology have shown that a much more generalized theoretical basis of pattern matching could be of tremendous benefit. To this end, pattern matching has to adapt itself to increasingly broader definitions of “matching”. In computational biology one may be interested in finding a “close” mutation, in communications one may want to adjust for transmission noise, in texts it may be desirable to allow common typing errors. In multimedia one may want to adjust for lossy compressions, occlusions, scaling, affine transformations or dimension loss.

All above applications motivate the Pattern Matching field. In pattern matching the input is a text and pattern and a “matching” relation. The output is all locations in the text where the pattern “matches” under the given definition of match. The different applications define the matching relation. Under the given matching relation there is still a distinction between exact matching and approximate matching. In the latter case, a metric is defined on the text. A text location is considered a match if the distance between it and the pattern, under the given metric, is within the tolerated bounds. The fundamental question is what type of approximations are inherently hard computationally, and what types are faster to compute. This question motivated much of the pattern matching research in the last twenty years.