This page is an attempt to gather information about various automata-related and DAWG-related resources in one place. We want to make it the reference site. The need for it became obvious as more and more people are rediscovering the same algorithms and methods, sometimes using different terminology. Additions and corrections are welcome - send them to the maintainer.
There are several books that contain introduction to automata theory. Some of them are listed below:
You can also find a few introductory courses on-line:
The NIST Dictionary of Algorithms and Data Structures provides many useful definitions. Wikipedia has an interesting entry on finite automata as well.
This page makes it absolutely clear that there are various algorithms for achieving the same goal. They can differ in various aspects. One of them is memory requirements. It became quite fashionable in some circles, e.g. among computational linguists, but also (surprisingly) among some computer scientists, to refer to any concern about memory requirements as old-fashioned. An argument that an algorithm using less memory than another algorithm is better than that other algorithm is no argument for them. They say it used to be like that some ten years ago, but not anymore. They are clearly wrong.
Ten years ago (whenever you read it - the growth is exponential anyway), computers were about 100 times slower, and they had about 100 times smaller memories. It is amazing that the difference is so huge. Does it mean that we use only 1% of the total amount of memory in our computers, and the other 99% is simply wasted? Have we totally abandoned compression? The answer is no. The reason for that is the same as for the fact that we have not abandoned fast algorithms in favor of slower but simpler ones. The data we use grows with our computers. We also use new kinds of data in new applications. While they might require new algorithms, some old algorithms work perfectly well on them - they just have bigger input.
Our computers not only grow; they can have... offsprings. The mode of interaction with every generation changes. First dinosaurs were fed with punched cards, and the results were available from the computer center the next day. Most of us have computers on their desks. Some of us have laptops or notebooks that can be used while traveling. There will be computers in watches. Why cannot we use them for the same tasks as mainframes? Although they are subject to the same law of growth, they have clearly much more limited memories. Some programs just don't fit into them. Unless we use memory-efficient algorithms, of course.
All the algorithms presented in this subsection take special advantage of the acyclicity of automata.
A traditional approach to constructing acyclic recognizers consists in building a trie (the algorithm for that is trivial), and then minimizing it using any of the well-known minimization algorithms. It can still be fast, but the trie can be enormous. Incremental (and to some extent also semi-incremental) algorithms presented below have much lower memory requirements.
An algorithm for incremental construction of minimal, deterministic, acyclic FSA from a sorted list of words (strings). It is very fast, and it uses the smallest amount of memory among all available methods for this task. Incremental means that words are added and integrated into a minimal automaton (or minimal except for one word) on-the-fly.
An algorithm for incremental construction of minimal, deterministic, acyclic FSA from an unsorted list of words (strings). It is very fast, but slower than the algorithm described above, and it is also economic in its use of memory, but the fact that the words may come in arbitrary order may make an intermediate automaton and its associated structures (a register of states) larger.
An algorithm for semi-incremental construction of minimal, deterministic, acyclic FSA from a presorted list of words. This algorithm does not construct the minimal automaton in one step, and the intermediate automaton can be large. The strings are inverted, then sorted, and then inverted again to compress endings, but this does not give the true minimization. Actually, the algorithm is used without presorting in the INTEX tool. The algorithm has the best worst case computational complexity, but in practice it is slower than the algorithms described above, except for the case when the strings have very long common suffixes (then it performs the best)..
An algorithm for semi-incremental construction of minimal, deterministic, acyclic FSA from a presorted list of words (strings). Words are sorted according to their decresing length.
If you want to compare various construction methods for acyclic automata, you might be interested in this page.
It contains an extended version of:An algorithm for incremental construction of nondeterministic, acyclic FSA from an unsorted list of words (strings).
There are two basic methods. Both lead to nondeterministic automata that can later be determinized and minimized.
The standard construction or Thompson construction is described in all major textbooks. It has been popularized by its use in grep and other UNIX tools. It comes in several variants.
The Glushkov or McNaughton-Yamada construction produces smaller automata. It was discovered independently by Glushkov and by McNaughton and Yamada. It is less known than the standard construction. The first three papers describe the original algorithm, the next two bring results about its complexity and its relation to Brzozowski derivatives, the last two present an elegant inductive step-by-step construction (also called Champarnaud-Glushkov construction). The Champarnaud-Glushkov construction is a different algorithm than the original one, but it leads to the same automaton (Glushkov automaton), and it is easier to understand and to implement.
The first two papers below introduce the ZPC-structure, that provides an ε-free alternative for Thompson's technique. The third paper essentially compares Thompson and ZPC-based techniques. The last one extends ZPC-based techniques to the case of weighted expressions.
The three following papers deal with the construction of a quotient of the Glushkov automaton, namely Antimirov automaton. The last one allows to convert a regular expression into its follow automaton, via its ZPC-structure.
For a comparison of Thompson and Glushkov constructions see the first paper. The second paper contains a chapter (about 50 pages) on various variants of both constructions (including determinization). The third one is different from that chapter in the second one.
For an extension of regular expressions see:
A taxonomy of minimization algorithms can be found in:
The fastest (O(|Q|log|Q|), where |Q| is the number of states) minimization algorithm (by John E. Hopcroft) is described in:
For an evaluation of various implementations of that algorithm, look at:
An algorithm that is considered by some people to be equally fast for frequently encountered data is Brzozowski's algorithm. It minimizes an automaton by performing reversal and determinization twice. The subset construction (determinization) is O(2|Q|), where Q is a set of states, but it depends on input (see master thesis by Ted Leslie in the Determinization section). The second paper here describes complexity of operations needed to implement Brzozowski's algorithm.
The following algorithm is different from all others because partial results can be used. It tests pairs of states for equivalence. The first paper describes a version with exponential time complexity. The second paper adds modifications that make it almost O(|Q|2).
The Hopcroft-Ullman is one of the best known. Its complexity is O(|Q|2), and it also requires O(|Q|2) memory, so it does not seem to be suitable for large automata. Pairs of states are considered, and the automaton has to be complete (i.e. it must have a sink, or dead state). A pair of states is distinguished if one of the states in the pair is final, and the other non-final. A pair of states is also distinguished if transitions labeled with the same symbol lead to another pair of states that was found distinguishable. For each pair of states, a list of pairs is found that are targets of transitions. Initially, pairs containing one final, and one non-final state are marked as distinguishable. Then all pairs that are on lists associated with those pairs are marked, and so on until no new pairs are marked. Pairs that are not marked are equivalent.
The Aho-Sethi-Ullman algorithm has the same complexity as the Hopcroft-Ullman algorithm, i.e. O(|Q|2), but it needs only O(|Q|) memory. As in the Hopcroft algorithm, the set of states is partitioned, and the blocks are refined in each step, until no more divisions occur. Two states are put into different blocks if they have transitions leading to states in different partitions. In difference to the Hopcroft algorithm, forward transitions are used for that purpose (the Hopcroft algorithm uses back transitions to split blocks with transitions leading to the current block).
For acyclic automata, there are better methods than those given above. The minimization part of incremental and semi-incremental construction algorithms can be used.
The term deterministic can mean two things:
Determinization of the first kind is performed using subset construction. Ted Leslie investigates various approaches to the problem, as well as the influence of various properties of automata (density etc.) on the complexity of determinization. Gertjan van Noord focuses on a subproblem of determinization - removal of epsilon-transitions. Jean-Marc Champarnaud investigates determinization of specific automata (the 4th paper), and then (with coauthors) applies the results to safe regular expression search. The last paper addresses the problem of reducing the number of unreachable states produced by a brute force determinization.
For transducers, the first three below include weighted transducers:
Most operations are described in:
These include weighted automata:
Fast operations on automata
Many compression methods are described in:
Another compression method, mostly incompatible with those above, but giving superb recognition speed (but slower for other tasks) is given in the first paper below. It is based on the second paper here.
There is something by Dominique Revuz, but he does not even care to list his publications.
Perfect hashing means a 1:1 mapping between a set of n unique words and numbers 1...n (or 0...n-1). There are two basic, similar ways to implement that. At each state, a number of strings recognized by the part of the automaton reachable from that state is stored. During the word to number conversion, numbers from targets of transitions preceding the transition being followed are added along the path spelling out the word. During the number to word conversion, at each state, numbers stored at targets of transitions are added to the total as long as it does not become greater than the word number. Then the next transition is followed. The numbers can be stored in transitions instead of states, but it increases the size of the automaton.
The second method is to store in each transition the number of words that are recognized along all paths that precede the current transition. In other words, a number of words that have smaller numbers than any word such that the current transition belongs to its path is stored in the current transition. This method is faster than the previous one, it can be used together with the sparse matrix representation (which also gives very fast recognition speed), and allows for transition sharing, but as there are more transitions than states (even when some transitions are shared), it takes more space.
There are two papers describing the same method. The method is based on computing the edit distance using dynamic programming.
A
short history of two-level morphology was presented in talk given
by Lauri
Karttunen and Kenneth R. Beesley at
Finite
State Methods in Natural Language Processing 2001, ESSLLI Workshop,
August 20-24, Helsinki.
Dictionary-based morphological analysis:Including probabilities
With recognizers
The same or similar methods are also used in INTEX and Unitex.
Approximative morphological analysis based on prefixes, infixes, and endings. The method basically consists of reversing words (prefixes and infixes have to be specially treated), appending the result (the analysis), constructing a DFA for such strings, and pruning the automaton by removing all states and transitions that all lead from a certain prefix to the same result. In the recognition phase, words are reversed and looked up in the automaton. When there are no more transitions to follow, all results reachable from the last state reached are printed.
There is a forum for disscussion of FSA-related issues at: http://groups.yahoo.com/group/FSA-Research. To subscribe, send message to FSA-Research-subscribe@yahoogroups.com. Since egroups were taken over by Yahoo, it is more difficult to register at the new site, and registering is required to view the list archive (but you can send and receive messages without registering). They require information about sex, age, occupation etc. Registering as a newly born Bolivian woman (Bolivia just happened to be under the cursor) did not work, because the system requires the users to be at least 2 years old. Registering as a two years old Bolivian woman did not work either because the system asked me to bring my parents. Now I'm 50 years old Bolivian woman...
If you register at the Yahoo site, don't forget to write down your password. Until recently, you could rely on cookies to let the server let you in. Now Yahoo became more aggressive, and they require you to resupply your password, or to give them your personal information - Big Brother is watching you. Perhaps we should move somewhere else... I'm glad this page is not on the Yahoo server.
Send new entries/corrections to the current maintainer Jan Daciuk (not a woman) at
user@domain, where user=jandac, and domain=eti.pg.gda.pl. Several
people contributed to this page: Gertjan van Noord, André
Kempe, Mark-Jan Nederhof, Max Silberztein, Sheng Yu, Bruce Watson,
Jean-Marc Champarnaud (let me know
if I forgot someone). A few others promised to contribute...
The page has been moved from Groningen to http://www.eti.pg.gda.pl/~jandac/fsm_algorithms/.
Due to local redirections here, a different address may show on the
address bar in your browser. However, if you bookmark this page,
please use the address given above.
This page is free of Javascript.