Next: Concluding Remarks Up: Construction of Guessing Automata Previous: Experimental Results

Related Work

In the past, various handcrafted heuristics have been used for the purpose of morphological analysis of unknown words. Later, they have been supplemented by statistical techniques (e.g. [WMS93]). However, although the probabilities of different endings leading to their corresponding categories were calculated, the endings themselves were chosen manually. A revolutionary approach was proposed by Eric Brill ([Bri93], [Bri95]). The endings, as well as prefixes, are found by the program. Unknown words are first tagged by a naive initial-state annotator that assignes the tags proper noun or common noun on the bases of their capitalization. Then five types of transformations are applied: Change the tag of an unknown word (from X) to Y if:

Deleting the prefix (suffix) x, , results in a word (x is any string of length 1 to 4).
The first (last) (1,2,3,4) characters of the word are are x.
Adding the character string x as a prefix (suffix) results in a word ().
Word w ever appears immediately to the left (right) of the word.
character z appears in the word.

The result is compared with a tagged corpus. The best-scoring rule is chosen, and applied to the corpus so that it becomes new input data. The learning stops when no rule can increase the score. The transformation type 4 takes into account the context of the unknown word. When the morphological analysis is separated from a contextual tagger, as it is the case with our approach, the tagger must find those rules itself. The transformation types 1 and 3 checks whether adding or deleting characters from/to a word results in another word. In our algorithm those transformations are not present, as well as transformation type 5, which can be treated, if necessary, as a supplementary heuristics. The rules schemata presented above do not account for infixes, which can be treated with our method.

Andrei Mikheev ([Mik97]) applied Brill's transformations to the data for a (pre-existing) morphological lexicon. He uses the following template:

where:

x indicates whether the rule is applied to the beginning or end of a word and has two possible values: b - beginning, e - end;
S is the affix to be segmented; it is deleted (-) from the beginning or end of an unknown word according to the value of x;
M is the mutative segment (possibly empty), which should be added (+) to the result string after segmentation;
i-class is the required POS-class (set of one or more POS-tags) of the stem; the result string after the -S and +M operations should be checked (?) in the lexicon for having this particular POS-class; if I-class is set to be ``void'' no checking is required;
R-class is the POS-class to assign () to the unknown word if all the above operations (-S +R ?I) have been successful

Compared to Brill's algorithm, Mikheev checks also (optionally) the morphological class (a set of categories) of the resulting word in transformation templates 1 and 3. Also, his algorithm returns all categories of an unknown word, not only the most probable one.

Although Mikheev is not aware of that fact, the rules learnt by Brill's algorithm can be transformed into a finite-state automaton. However, that process is complex and time-consuming. The learning process takes time as well. Our algorithm produces the FSA directly from data exploring the links that occur naturally in the format of data we use. In Brill's approach, copied by Mikheev, the length of suffixes and prefixes is a constant. Increasing it means much more computation. In our method, the suffixes are discovered naturally, so there is no need to limit their lengths.

Next: Concluding Remarks Up: Construction of Guessing Automata Previous: Experimental Results

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html