Construction of minimal, deterministic, acyclic finite-state automata from sets of strings
Algorithms implemented and compared
- Incremental algorithm for sorted data, described in:
Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson,
Incremental Construction of Minimal Acyclic Finite State Automata,
Computational Linguistics, 26(1), March 2000. See my publications.
- Incremental algorithm for unsorted data, described in:
Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson,
Incremental Construction of Minimal Acyclic Finite State
Automata, Computational Linguistics, 26(1), March 2000. See my publications.
- Semi-incremental algorithm by Dominique Revuz, described in:
Dominique Revuz, Dictionnaires et lexiques: méthodes et
algorithmes, PhD thesis, Institut Blaise Pascal, Paris,
France, 1991. LITP 91.44.
- Semi-incremental construction algorithm by Bruce Watson,
described in: Bruce Watson, A fast new (semi-incremental)
algorithm for the construction of minimal acyclic DFAs. In:
Third Workshop on Implementing Automata, pages 01-98, Rouen,
France, September 1998. Lecture notes in Computer Science,
Springer.
- Building a trie and minimizing it using the Hopcroft's
minimization algorithm.
- Building a trie and minimizing it using the minimization phase
from the incremental algorithms (postorder, register-based
minimization).
- Building a trie and minimizing it using the minimization phase
from the semi-incremental algorithm by Dominique Revuz
(lexicographical sort on states of the same height).
- Expanding a minimal automaton to the corresponding pseudo-minimal
automaton.
- Initial part of Revuz's algorithm, i.e. building the pseudo-minimal
automaton from strings lexicographically sorted on their reversals.
- Modification of the incremental algorithm for sorted data for building
pseudo-minimal automata.
- Modification of the incremental algorithm for unsorted data for building
pseudo-minimal automata.
- Modification of the semi-incremental Watson's algorithm.
- Algorithm for building dynamic hash automata when consecutive words from
input are assigned consecutive natural numbers. The automata are minimal
when weights are considered parts of labels.
Results and recommendations for building minimal automata
- The fastest algorithms for word lists were the incremental
algorithm for sorted data and its non-incremental
equivalent. For morphological dictionaries, Revuz's algorithm
was faster.
- Both incremental algorithms use the smallest amount of memory,
much less than other algorithms, including semi-incremental
ones.
- Both incremental algorithms are the fastest in practice - memory
use makes other algorithms go into swapping, which is painfully
slow. This remark also concerns Revuz's algorithm.
- It does not seem that incremental methods introduce much
overhead. The non-incremental version of sorted data incremental
algorithm was ususally faster than its incremental version, but
the difference was not big, and the advantage could be used only
for small data sets. Revuz's algorithm was faster than its
non-incremental counterpart on morphological dictionaries, but
slower on words.
- Trie + Hopcroft minimization is the slowest algorithm. While all
other methods are linear, this one has an additional logarithmic
overhead, and it is quite complicated compared to register-based
algorithms.
- There is no real reason to use Revuz's algorithm (for
morphologies, annotations can be shortened and kept elsewhere,
as in e.g. INTEX), Watson's algorithm, trie
+ lexicographical sort (Revuz's algorithm without suffix
compression), or trie + Hopcroft minimization algorithm. They
are slower, and they have higher memory requirements than
incremental algorithms.
Pseudo-minimal Automata
This version of adfa also construct pseudo-minimal automata. For each
word in the language of the automaton, there is a transition (called proper
transition that is not shared with any other word. That proper transition
can be used carry word-specific information. The words on input must end with
a special end-of-word marker that is not part of the alphabet of words
(e.g. #). Otherwise apart from proper transitions, we would have proper states
as well.
Software
The software
used in the experiments is available in source form free of charge for
non-commercial purposes. I want to put it under GNU Public Licence once I read
it and understand it in full. adfa builds
automata, and displays some statistics, but the automata are not
saved. This was simply not needed for the experiments. I'm sorry if
you are disappointed.
Further information
The experiments were conducted for the paper I sumbitted to the
Conference on Implementation of Automata and Applications. Proceedings
of that series of Conferences are printed by Springer, and the
publisher asks the authors not to put their papers on their websites
for one year after publication. So here is a draft.
You can find more information about finite-state automata,
minimization, etc. on a page about FSA
related algorithms. You can look at my two fully functional software packages implementing two of the above
algorithms. The automata I use there are a bit different - they have
final transitions instead of final states. This makes them a bit
smaller than traditional automata. The packages are for use in the natural
language processing domain. You may also want to look at my publications.
The algorithms in the adfa package operate on acyclic automata,
i.e. automata without cycles. Some people find this page while looking
for general minimization algorithms. They should rather look at
http://www.eti.pg.gda.pl/~jandac/minim.html,
a page dedicated to minimization of automata with cycles.
Jan Daciuk