The inital material of *Forty Years of Text Indexing*, shown below,
is from the original of an invited contribution to
CPM 2013, the 24th Annual Symposium on Combinatorial Pattern Matching,
Bad Herrenalb, Germany June 17-19.
See CPM 2013, for
additional information.

All of the papers presented at CPM 2013 can be found at SpringerLink.

For information about the special session at CMP 2013, organized by Martin Farach-Colton and S. Muthukrishnan, celebrating the invention of the Suffix Tree by Peter G Weiner in 1973, see 1973+40=2013.

Six videos of the presentations at 1973+40=2013 by Martin Colton-Farach, Peter G Weiner, Vaughan R Pratt, Edward M McCreight, and Jacob Ziv, together with a concluding Panel Session, can be viewed on Vimeo at 1973+40=2013 Videos.

**Abstract.**
This paper reviews the ﬁrst 40 years in the life of textual
inverted indexes, their many incarnations, and their applications. The
paper is non-technical and assumes some familiarity with the structures
and constructions discussed. It is not meant to be exhaustive. It is meant
to be a tribute to a ubiquitous tool of string matching — the suffix tree
and its variants — and one of the most persistent subjects of study in
the theory of algorithms.

**Keywords:**
pattern matching, string searching, bi-tree, suffix tree, dawg, suffix
automaton, factor automaton, suffix array, FM-index, wavelet tree.

When William Legrand ﬁnally decrypted the string:

53‡‡†305))6*,48264‡.)4‡);806”,48†8P60))85;1‡(;:‡*8†83(88)5*†,46(;88*96

*?;8)*‡(;485);5*†2:*‡(;4956*2(5*4)8P8*;4069285);)6‡8)4‡‡;1(‡9;48081;8:8

‡1;48 85;4)485†528806*81(ddag9;48;(88;4(‡?34;48)4‡;161;:188;‡?;

it did not seem to make much more sense than it did before. The decoded message read: “A good glass in the bishop’s hostel in the devil’s seat forty-one degrees and thirteen minutes northeast and by north main branch seventh limb east side shoot from the left eye of the death’s-head a bee line from the tree through the shot ﬁfty feet out.” But at least it did sound more like natural language, and eventually guided the main character of Edgar Allan Poe’s “The Gold Bug” [56] to discover the treasure he had been after. Legrand solved a substitution cipher using symbol frequencies. He ﬁrst looked for the most frequent symbol and changed it into the most frequent letter of English, then similarly treated the second most frequent symbol, and so on.

Both before and after 1843, the natural impulse when faced with some mysterious
message has been to count frequencies of individual tokens or subassemblies
in search of a clue. Perhaps one of the most intense and fascinating subjects for
this kind of scrutiny has been bio-sequences. As soon as some such sequences
became available, statistical analyses tried to link characters or blocks of characters
to relevant biological function. With the early examples of whole genomes
emerging in the mid 90’s, it seemed natural to count the occurrences of all blocks
of size 1, 2, etc. up to any desired length, looking for statistical characterizations
of coding regions, promoter regions, etc. [67].
This review is not about cryptography. It is about a data structure and its
variants, and the many surprising and useful features it carries. Among these is
the fact that, to set up a statistical table of occurrences for
*all* substrings, of *any*
length, of a text string of *n* characters, it only takes time and space linear in the
length of the text string. While nobody would be so foolish to solve the problem
by generating all exponentially many possible substrings and then counting their
occurrences one by one, a text string may still contain
*O*(*n*^{2}) distinct substrings,
so that tabulating all of them in linear space, never mind linear time, seems
already puzzling.

Over the years, such structures have held center stage in text searching, indexing, statistics, and compression as well as in the assembly, alignment and comparison of biosequences. Their range of scope extends to areas as diverse as detecting plagiarism, ﬁnding surprising substrings in a text, testing the unique decipherability of a code, and more. Their impact on Computer Science and IT at large cannot be overstated. Text and Web searching and Bioinformatics would not be the same without them. In 2013, the Combinatorial Pattern Matching symposium celebrates the 40th anniversary of the appearance of Weiner's paper with a special session entirely dedicated to that event.

At the dawn of “stringology”, Don Knuth conjectured that the problem of finding
the longest substring common to two long text sequences of total length *n*
required *Ω*(*n* log *n*) time.
An *O*(*n* log *n*)-time had been provided by Karp, Miller
and Rosenberg [40]. That construction was destined to play a role in parallel
pattern matching [6, 24, 31, 32], but Knuth’s conjecture was short lived: in
1973, Peter Weiner showed that the problem admitted an elegant linear-time
solution [68], as long as the alphabet of the string was ﬁxed. Such a solution
was actually a byproduct of a construction he had originally set up for a different
purpose, i.e., identifying any substring of a textfile without specifying all of
them. In doing so, Weiner introduced a notion of a textual inverted index that
would elicit refinements, analyses and applications for forty years and counting,
a feature hardly shared by any other data structure.

* * * (We stop near the bottom of page 2 out of 10 in the full published paper.)

* College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta,
GA 30318, USA axa@cc.gatech.edu

** King’s College London, London WC2R 2LS, UK and Universite´ Paris-Est, France
maxime.crochemore@kcl.ac.uk

*** Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA
farach@cs.rutgers.edu

† College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta,
GA 30318, USA galil@cc.gatech.edu

‡ Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA
muthu@cs.rutgers.edu

To get access to the full published paper go to SpringerLink.