Next: FSA Package Up: POLITECHNIKA GDANSKA Technical Previous: Perfect Hashing

Summary

A new type of automata, and a number of new algorithms are presented in this dissertation. Finite-state automata with final transitions (FSA-FT) can hold exactly the same information as classical finite-state automata. However, they can be smaller than their classical counterparts, i.e. they can have less states and less transitions. Most implementations have implicit states, and the features of states are stored in transitions, so in fact they use FSA-FT. Unfortunately, those automata are simple transformations of minimal classical FSA , so they have the same (sub-minimal) number of states and transitions. We use the new kind of automata explicitly from the beginning, making our automata smaller.

We present two new algorithms for constructing deterministic acyclic finite-state automata. Their main characteristics is that they are incremental. The main implication of this fact is that they require much less memory than other methods that need to have all the data in memory before they start. For one of the dictionaries we used in our experiments that would mean having to store about 2 GB, which is practically impossible in the current technology. The use of DAFSA-FT makes our automata even smaller, and as the construction algorithms require memory proportional to the size of the automaton, the memory requirements of those algorithms are additionally lowered in our implementation. We are aware of another algorithm that builds deterministic acyclic FSA incrementally ([Rev91]). That algorithm (by Dominique Revuz) has similar characteristics to our algorithms. However, Dominique Revuz constructs automata in two steps: the first produces an automaton after a pseudo-minimization , the second one is the true minimization . The automaton after the pseudo-minimization phase can be 8 times larger ([Rev91]) than the minimal one, thus increasing the intermediate memory requirements. Our algorithms are also more flexible than his. The algorithm for sorted data makes it possible to perform operations on the automaton on-the-fly, because the parts once minimized are no longer changed. Our algorithm for data in arbitrary order does not require data to be sorted, so it can be fed directly from other programs that produce data, minimizing the storage space for intermediate results.

Words not present in the lexicon pose many problems for the treatment of free texts. We present a method for constructing a guesser from the data for a morphological lexicon . Our guesser is in form of a finite-state automaton, with the usual advantages of these machines: great speed and compact representation. The guesses are made on the basis of word endings and beginnings. The method predicts both the corresponding lexemes, and the categories associated with words. Compared to traditional approaches, i.e. to using textbook rules, our method relies directly on data, so all exceptions that are so easily overlooked in manual processing are included. Compared to Eric Brill's solution ([Bri93] and [Bri95]), we infer the rules directly from the lexicon, not from the text, so we have more accurate information as the input. There is another lexicon-based method for guessing rule inference ([Mik97]). Andrei Mikheev uses Eric Brill's learning techniques and rule schemata, while in our approach the rules are implicit in the automaton. In fact, the guesser is an automaton that is a transformation of another automaton built from data prepared in a special way. We do not use rule schemata, so the rules we infer are more general.

The same algorithm can be used for acquisition of a morphological lexicon . A core lexicon based on two-level morphology should be built. The lexicon should contain all rules. The guesser should have morphological descriptions as annotations, i.e. the corresponding lexeme and the continuation classes. As at the moment of writing we do not have a two-level morphology for Polish, we could not test our ideas yet.

Our enhancement for Kemal Oflazer's algorithm for spelling correction makes it possible to give proper corrections for incorrect Polish words. The problem was that there are frequent errors that pass the edit distance barrier, but are specific enough to be recognized and corrected. Our solution is more of a kludge; a more elegant solution could use weighted transducers.

We put the ideas and algorithms in context, and refer the reader to other sources when appropriate. Our solutions are compared to others, and the advantages of our methods are pointed out clearly. Our algorithms are implemented in C++, and are freely available in the Internet for non-commercial purposes at the following url: http://www.pg.gda.pl/ jandac/fsa.html.

The theses stated in the introduction:

It is possible to construct minimal, deterministic, acyclic, finite-state automata using less memory than in all previously known algorithms while maintaining the speed of the construction process.
It is possible to construct automata that guess canonical forms of words and their annotations directly from data for a morphological lexicon. Little or no linguistic knowledge is required for that process. We state that this process can be faster than the previously known algorithm serving the same purpose, and that it maintains similar quality of guessing.

have been upheld. We proposed two algorithms for construction of minimal, deterministic, acyclic, finite-state automata that have the desired features. We have also presented a method for construction of a finite-state automaton that guesses the categories of unknown words, and guesses their corresponding canonical forms. That method is indeed faster than other methods with similar quality of guessing, and requires little or no linguistic knowledge other than the knowledge embedded in the data.

Next: FSA Package Up: POLITECHNIKA GDANSKA Technical Previous: Perfect Hashing

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html