Construction of minimal, deterministic, acyclic finite-state automata from sets of strings

Algorithms implemented and compared

Incremental algorithm for sorted data, described in: Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson, Incremental Construction of Minimal Acyclic Finite State Automata, Computational Linguistics, 26(1), March 2000. See my publications.
Incremental algorithm for unsorted data, described in: Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson, Incremental Construction of Minimal Acyclic Finite State Automata, Computational Linguistics, 26(1), March 2000. See my publications.
Semi-incremental algorithm by Dominique Revuz, described in: Dominique Revuz, Dictionnaires et lexiques: méthodes et algorithmes, PhD thesis, Institut Blaise Pascal, Paris, France, 1991. LITP 91.44.
Semi-incremental construction algorithm by Bruce Watson, described in: Bruce Watson, A fast new (semi-incremental) algorithm for the construction of minimal acyclic DFAs. In: Third Workshop on Implementing Automata, pages 01-98, Rouen, France, September 1998. Lecture notes in Computer Science, Springer.
Building a trie and minimizing it using the Hopcroft's minimization algorithm.
Building a trie and minimizing it using the minimization phase from the incremental algorithms (postorder, register-based minimization).
Building a trie and minimizing it using the minimization phase from the semi-incremental algorithm by Dominique Revuz (lexicographical sort on states of the same height).
Expanding a minimal automaton to the corresponding pseudo-minimal automaton.
Initial part of Revuz's algorithm, i.e. building the pseudo-minimal automaton from strings lexicographically sorted on their reversals.
Modification of the incremental algorithm for sorted data for building pseudo-minimal automata.
Modification of the incremental algorithm for unsorted data for building pseudo-minimal automata.
Modification of the semi-incremental Watson's algorithm.
Algorithm for building dynamic hash automata when consecutive words from input are assigned consecutive natural numbers. The automata are minimal when weights are considered parts of labels.

Results and recommendations for building minimal automata

The fastest algorithms for word lists were the incremental algorithm for sorted data and its non-incremental equivalent. For morphological dictionaries, Revuz's algorithm was faster.
Both incremental algorithms use the smallest amount of memory, much less than other algorithms, including semi-incremental ones.
Both incremental algorithms are the fastest in practice - memory use makes other algorithms go into swapping, which is painfully slow. This remark also concerns Revuz's algorithm.
It does not seem that incremental methods introduce much overhead. The non-incremental version of sorted data incremental algorithm was ususally faster than its incremental version, but the difference was not big, and the advantage could be used only for small data sets. Revuz's algorithm was faster than its non-incremental counterpart on morphological dictionaries, but slower on words.
Trie + Hopcroft minimization is the slowest algorithm. While all other methods are linear, this one has an additional logarithmic overhead, and it is quite complicated compared to register-based algorithms.
There is no real reason to use Revuz's algorithm (for morphologies, annotations can be shortened and kept elsewhere, as in e.g. INTEX), Watson's algorithm, trie + lexicographical sort (Revuz's algorithm without suffix compression), or trie + Hopcroft minimization algorithm. They are slower, and they have higher memory requirements than incremental algorithms.

Pseudo-minimal Automata

This version of adfa also construct pseudo-minimal automata. For each word in the language of the automaton, there is a transition (called proper transition that is not shared with any other word. That proper transition can be used carry word-specific information. The words on input must end with a special end-of-word marker that is not part of the alphabet of words (e.g. #). Otherwise apart from proper transitions, we would have proper states as well.

Software

The software used in the experiments is available in source form free of charge for non-commercial purposes. I want to put it under GNU Public Licence once I read it and understand it in full. adfa builds automata, and displays some statistics, but the automata are not saved. This was simply not needed for the experiments. I'm sorry if you are disappointed.

Further information

The experiments were conducted for the paper I sumbitted to the Conference on Implementation of Automata and Applications. Proceedings of that series of Conferences are printed by Springer, and the publisher asks the authors not to put their papers on their websites for one year after publication. So here is a draft.

You can find more information about finite-state automata, minimization, etc. on a page about FSA related algorithms. You can look at my two fully functional software packages implementing two of the above algorithms. The automata I use there are a bit different - they have final transitions instead of final states. This makes them a bit smaller than traditional automata. The packages are for use in the natural language processing domain. You may also want to look at my publications.

The algorithms in the adfa package operate on acyclic automata, i.e. automata without cycles. Some people find this page while looking for general minimization algorithms. They should rather look at http://www.eti.pg.gda.pl/~jandac/minim.html, a page dedicated to minimization of automata with cycles.

Jan Daciuk