In the past, various handcrafted heuristics have been used for the
purpose of morphological analysis of unknown words. Later, they have
been supplemented by statistical techniques
(e.g. [WMS93]). However, although the probabilities
of different endings leading to their corresponding categories were
calculated, the endings themselves were chosen manually. A
revolutionary approach was proposed by Eric Brill
([Bri93], [Bri95]). The
endings, as well as prefixes, are found by the program. Unknown words
are first tagged by a naive initial-state annotator that assignes the
tags proper noun or common noun on the bases of their
capitalization. Then five types of transformations are applied:
Change the tag of an unknown word (from X) to Y if:
The result is compared with a tagged corpus. The best-scoring rule is chosen, and applied to the corpus so that it becomes new input data. The learning stops when no rule can increase the score. The transformation type 4 takes into account the context of the unknown word. When the morphological analysis is separated from a contextual tagger, as it is the case with our approach, the tagger must find those rules itself. The transformation types 1 and 3 checks whether adding or deleting characters from/to a word results in another word. In our algorithm those transformations are not present, as well as transformation type 5, which can be treated, if necessary, as a supplementary heuristics. The rules schemata presented above do not account for infixes, which can be treated with our method.
Andrei Mikheev ([Mik97]) applied Brill's transformations to the data for a (pre-existing) morphological lexicon. He uses the following template:
where:
Compared to Brill's algorithm, Mikheev checks also (optionally) the morphological class (a set of categories) of the resulting word in transformation templates 1 and 3. Also, his algorithm returns all categories of an unknown word, not only the most probable one.
Although Mikheev is not aware of that
fact, the rules learnt by
Brill's algorithm can be transformed into a finite-state
automaton. However, that process is complex and time-consuming. The
learning process takes time as well. Our algorithm produces the FSA
directly from data exploring the links that occur naturally in the
format of data we use. In Brill's approach, copied by Mikheev, the
length of suffixes and prefixes is a constant. Increasing it means
much more computation. In our method, the suffixes are discovered
naturally, so there is no need to limit their lengths.