or
fsa_ubuild [ options ] [ > outfile ]
fsa_ubuild does the same thing, but more slowly, and using more memory. However, data may not be sorted, so it may come directly from another process, e.g. a morphology program (possibly modified with a filter).
The output is an automaton that can be used by fsa_spell, fsa_accent, fsa_morph, fsa_guess, fsa_hash, and fsa_prefix. If you do not use -O option, the automata produced by fsa_build and fsa_ubuild may not be identical (the order of arcs may be different), but isomorphic. If you do use -O option, the size may differ as well (although I have only seen a difference by 1 arc or transition). This happens because the compression algorithm does not find the optimal solution. Although the algorithm is the same in both cases, the ordering of information in the register differs, as hash function uses addresses of nodes. Note that the automata are still isomorphic; it is only compression rate that varies.
For fsa_guess compiled without GUESS_LEXEMES, the input data should be a list of inverted words with annotations. Each line should contain an inverted word (i.e. the first character should be the last character of the word, the second one - the penultimate one, and so on. This inverted word should be followed immediately by a filler character and an annotation separator, and then by grammatical annotations. They specify some morphosyntactic properties of words, such as number, gender, etc.
Assuming that a file file contains data in 3 columns: inflected word, canonical form, annotations, the following incantation:
awk '{s="";for(i=1;i<=length($1);i++)s=substr($1,i,1) s;printf "%s_+%s\n",s,$3;}' file | sort -u > file.idx
prepares data for the a tergo index. The incantation should be all in one line. For more detail see the contents of prep_atg.awk file included in the distribution. The standard name extension for automata prepared in this way is atg.
For fsa_guess compiled with GUESS_LEXEMES, but without GUESS_PREFIX, one data line should contain the same information as above, but an additional annotation separator, a code, and the ending of the corresponding lexeme must be inserted in front of the first annotation separator. The code specifies how many characters from the end of the inflected word must be rejected before appending the ending of the lexeme. The code is a letter. 'A' means there are no characters to reject, 'B' - there is one, 'C' - 2, and so on. For more detail see prep_atl.awk file included in the distribution. The standard name extension for automata prepared in this way is atl.
For fsa_guess compiled with both GUESS_LEXEMES, and GUESS_PREFIX, data lines are similar to those specified above. For inflected forms that do not contain flectional prefixes, an additional annotation separator is added after the first one (see prep_atp.awk file included in the distribution). For inflected forms that do contain flectional prefixes, the prefix is removed from the inverted word leaving the filler character, and it is placed between two annotation separators in simple, non-inverted form. The prep_atp.awk file does not contain code for recognizing prefixes; it should be modified for individual languages and recognize specific morphological categories. Only prefixes that differentiate between forms that have the same suffix should be recognized. The standard name extension for automata prepared in this way is atp.