Simple automata (acceptors) can be used for storing a morphological lexicon. A single word of the language accepted by the automaton should represent morphological information for one inflected word form. We no longer have the two-level structure embedded in the transducer, therefore we need to express that structure inside the word of the language accepted by the automaton.
We divide the word into two parts: the inflected form, and annotations separated from the inflected form by an annotation separator . The annotations may contain lexemes and morphological categories . They have their own internal structure.
In order to minimize the size of the automaton, there should be an order imposed on all categories that can appear in the same entry. The categories should always be written in the same order. Categories that are present in most entries should be written last, i.e. the first category should be the most specific, and the last the most general.
Including lexemes in full could inflate the lexicon to a horrible size. This is because automata compress the beginnings and endings of strings, but are less good at compressing the inner parts of strings. The word's stem (or root) may be specific to that word. If not, it occurs in few words (i.e. few lexemes). The beginnings are compressed because there are many words that share a few letters at their starts. The endings are shared with many words. But when we put the inflected form and the lexeme one after another, we have twice the central part of the stem, with an ending (each time different) in between. The sequence of those three items: the stem, the ending, and once again the stem is unique, and it cannot be shared!
There is a solution to this problem. As the inflected form and the lexeme share the word-specific information, i.e. the stem, it needs to occur once. It is sufficient to indicate how many letters from the end of the inflected form should be deleted, and what should be added to the stem to obtain the lexeme. We use a code to indicate the number of letters that should be deleted from the end of the inflected form. ``A'' means nothing is to be deleted, ``B'' - 1 letter, ``C'' - to letters, and so on. The code is followed by the ending of the corresponding lexeme (fig. 3.3).
Figure 3.3: Format of input data for "spała" for a simple
automaton - acceptor
Inflected forms may have prefixes that are not present in lexemes. As the beginning of inflected form is different from the corresponding lexeme, whole lexemes must be stored, and so they inflate automata (see fig. 3.4). The prefixes may influence categories.
Figure 3.4: Prefixes inflate automata by enforcing storing lexemes in
full form
In order to decrease the influence of the prefixes on the size of the automaton, a method similar to that described above can be used. A code (the name prefix code will be used here) may be introduced to indicate how many letters from the beginning of the inflected form should be deleted. ``A'' means there are no prefixes, ``B'' - 1-letter prefix, ``C'' - 2-letter prefix, etc. The code is inserted in front of the code indicating how many letters to delete from the end of the word. Figure 3.5 shows the morphological data for najszybszy - the same as in fig. 3.4 - but this time with prefixes coded to save space. The first ``D'' after the annotation separator means that the first 3 (i.e. ``D'' - ``A'') letters from najszybszy should be deleted, giving szybszy. The second ``D'' says that 3 letters from the end of szybszy should be deleted, giving szyb. Finally, ki is appended to szyb giving szybki - the lexeme.
Figure 3.5: Coded prefix in data for morphological analysis with
automata - acceptors
For the Polish morphology containing 1748097 words and expressions, the automaton with coded prefixes was 15.6% smaller than that without coded prefixes. For French, the automaton with coded prefixes would be bigger than the original one, because French does not have flectional prefixes.
Word structure can be more complicated. In German, the verbs can have two parts. The first part is loosely connected to the second one, and can be separated from it by an infix, or can be put separately somewhere else in a sentence. For example, eingeladen and ladet ein are two valid forms of the verb einladen. In order to cope with the problem, a code and a special annotation need to be introduced. The code should be inserted after the prefix code when the prefix code says that there is a prefix, i.e. when it is different from ``A''. The code (the name infix code will be used here) specifies the offset from the beginning of the inflected form of the characters to be deleted. The convention is the same as for other codes, i.e. ``A'' means the letters to be deleted are right at the beginning (true prefix), ``B'' means that the letters to be deleted start at the second letter of the inflected form, and so on. Figure 3.6 shows the entry for eingeladet. The first annotation separator is followed by 3 capital letters: ``C'' - 2 characters are to be deleted, ``D'' - the characters to be deleted start after the third character of the inflected form, ``B'' - one character should be deleted form the end of the inflected form. The string ki that follows the capital letters is the suffix of lexeme.
Figure 3.6: Coded infix in morphological data
The special annotation is similar to the method used to represent prefixes in the next section. Between the inflected form and the lexeme, another annotation separator can be used to introduce the removable prefix. Figure 3.7 shows how to indicate that ladet can be a part of ladet ein (this ein can be separated form ladet by many words) .
Figure 3.7: Removable prefix coding in an automaton for morphological
analysis