Next: EXIT STATUS
Up: fsa95build
Previous: DESCRIPTION
- -O
- make the resulting automaton smaller. The time
required to build the automaton is much greater.
How much greater depends on compile options used
during compilation of fsa_build. See Makefile and
INSTALL from the distribution for an explanation of
various compile options. The default options compress
the automaton the most. This option cannot
be used with -N option.
- -i input_file
-
specifies input file. That file should contain a
list of words, one word per line. In absence of -i
option, standard input is used instead.
- -o output_file
-
specifies output file, i.e. where the automaton
should be placed. In absence of -o option, standard
output is used instead.
- -A annotation_separator
-
specifies a character that separates words from
morphological annotations.
- -X
- prepares an index a tergo that is used to predict
word categories. This option is available only if
the program was compiled with A_TERGO compile
option. Specifying PRUNE_ARCS compile option helps
making the resulting automaton smaller and faster.
These compile options are on by default. The format
of data depends on compile options used to build
the fsa_guess program, and affects the outcome of
that program.
For fsa_guess compiled without GUESS_LEXEMES, the
input data should be a list of inverted words with
annotations. Each line should contain an inverted
word (i.e. the first character should be the last
character of the word, the second one - the
penultimate one, and so on. This inverted word
should be followed immediately by a filler character
and an annotation separator, and then by grammatical
annotations. They specify some morphological
properties of words, such as number, gender,
etc.
Assuming that a file file contains data in 3
columns: inflected word, canonical form, annotations,
the following incantation:
awk `{s=""; for(i=1;i<=length($1);i++) s = substr($1,i,1)
s;
printf ``%s_+%sn",s,$3;}' file | sort
-u > file.idx
prepares data for the a tergo index. The incantation
should be all in one line. For more detail see
the contents of prep_atg.awk file included in the
distribution. The standard name extension for
automata prepared in this way is atg.
For fsa_guess compiled with GUESS_LEXEMES, but
without GUESS_PREFIX, one data line should contain
the same information as above, but an additional
annotation separator, a code, and the ending of the
corresponding lexeme must be inserted in front of
the first annotation separator. The code specifies
how many characters from the end of the inflected
word must be rejected before appending the ending
of the lexeme. The code is a letter. `A' means
there are no characters to reject, `B' - there is
one, `C' - 2, and so on. For more detail see
prep_atl.awk file included in the distribution. The
standard name extension for automata prepared in
this way is atl.
For fsa_guess compiled with both GUESS_LEXEMES, and
GUESS_PREFIX, data lines are similar to those specified
above. For inflected forms that do not contain
flectional prefixes, an additional annotation
separator is added after the first one (see
prep_atp.awk file included in the distribution).
For inflected forms that do contain flectional prefixes,
the prefix is removed from the inverted word
leaving the filler character, and it is placed
between two annotation separators in simple, noninverted
form. The prep_atp.awk file does not contain
code for recognizing prefixes; it should be
modified for individual languages and recognize
specific morphological categories. Only prefixes
that differentiate between forms that have the same
suffix should be recognized. The standard name
extension for automata prepared in this way is atp.
- -N
- number entries. All entries are numbered according
to their position (line number) in the input
stream. This is so called perfect hashing . This
option works only if the program was compiled with
NUMBERS compile option. This option (i.e. -N) cannot
be used with -O option.
- -v
- print version details with compile options used.
Next: EXIT STATUS
Up: fsa95build
Previous: DESCRIPTION
Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998
Software at http://www.pg.gda.pl/~jandac/fsa.html