A new type of automata, and a number of new algorithms are presented in this dissertation. Finite-state automata with final transitions (FSA-FT) can hold exactly the same information as classical finite-state automata. However, they can be smaller than their classical counterparts, i.e. they can have less states and less transitions. Most implementations have implicit states, and the features of states are stored in transitions, so in fact they use FSA-FT. Unfortunately, those automata are simple transformations of minimal classical FSA , so they have the same (sub-minimal) number of states and transitions. We use the new kind of automata explicitly from the beginning, making our automata smaller.
We present two new algorithms for constructing deterministic acyclic
finite-state automata. Their main characteristics is that they are
incremental. The main implication of this fact is that they
require much less memory than other methods that need to have
all the data in memory before they start. For one of the dictionaries
we used in our experiments that would mean having to store about 2 GB,
which is practically impossible in the current technology. The use of
DAFSA-FT makes our automata even smaller, and as the construction
algorithms require memory proportional to the size of the automaton,
the memory requirements of those algorithms are additionally lowered
in our implementation. We are aware of another algorithm that builds
deterministic acyclic FSA incrementally
([Rev91]). That algorithm (by Dominique Revuz) has
similar characteristics to our algorithms. However, Dominique Revuz
constructs automata in two steps: the first produces an automaton
after a pseudo-minimization , the second
one is the true minimization . The automaton after
the pseudo-minimization phase can be 8 times larger
([Rev91]) than the minimal one, thus increasing the
intermediate memory requirements. Our algorithms are also
more flexible than his. The algorithm for sorted data makes it
possible to perform operations on the automaton on-the-fly, because
the parts once minimized are no longer changed. Our algorithm for data
in arbitrary order does not require
data to be sorted, so it can be fed directly from other programs that
produce data, minimizing the storage space for intermediate results.
Words not present in the lexicon pose many problems for the treatment of free texts. We present a method for constructing a guesser from the data for a morphological lexicon . Our guesser is in form of a finite-state automaton, with the usual advantages of these machines: great speed and compact representation. The guesses are made on the basis of word endings and beginnings. The method predicts both the corresponding lexemes, and the categories associated with words. Compared to traditional approaches, i.e. to using textbook rules, our method relies directly on data, so all exceptions that are so easily overlooked in manual processing are included. Compared to Eric Brill's solution ([Bri93] and [Bri95]), we infer the rules directly from the lexicon, not from the text, so we have more accurate information as the input. There is another lexicon-based method for guessing rule inference ([Mik97]). Andrei Mikheev uses Eric Brill's learning techniques and rule schemata, while in our approach the rules are implicit in the automaton. In fact, the guesser is an automaton that is a transformation of another automaton built from data prepared in a special way. We do not use rule schemata, so the rules we infer are more general.
The same algorithm can be used for acquisition of a morphological lexicon . A core lexicon based on two-level morphology should be built. The lexicon should contain all rules. The guesser should have morphological descriptions as annotations, i.e. the corresponding lexeme and the continuation classes. As at the moment of writing we do not have a two-level morphology for Polish, we could not test our ideas yet.
Our enhancement for Kemal Oflazer's algorithm for spelling correction makes it possible to give proper corrections for incorrect Polish words. The problem was that there are frequent errors that pass the edit distance barrier, but are specific enough to be recognized and corrected. Our solution is more of a kludge; a more elegant solution could use weighted transducers.
We put the ideas and algorithms in context, and refer the reader to other sources when appropriate. Our solutions are compared to others, and the advantages of our methods are pointed out clearly. Our algorithms are implemented in C++, and are freely available in the Internet for non-commercial purposes at the following url: http://www.pg.gda.pl/ jandac/fsa.html.
The theses stated in the introduction:
have been upheld. We proposed two algorithms for construction of minimal, deterministic, acyclic, finite-state automata that have the desired features. We have also presented a method for construction of a finite-state automaton that guesses the categories of unknown words, and guesses their corresponding canonical forms. That method is indeed faster than other methods with similar quality of guessing, and requires little or no linguistic knowledge other than the knowledge embedded in the data.