Po polsku
This page contains no Javascript. No need to analyze the source!
Important note
Due to problems with GUT WWW servers, and due to GUT's planned switch
to another internet domain, this page has been moved to a new
location: http://www.jandaciuk.pl/fsa.html. Note
that the software repository on an FTP server had to be moved as well,
as not only it will be affected by the domain switch (the current
domain will be abandoned!), but GUT's FTP server crew is unreliable
(they switched off the server for several month at the beginning of 2016).
Finite state utilities
This page describes two software packages and some accompanying files
available from
http://www.jandaciuk.pl/Software/Fsa/
and http://www.jandaciuk.pl/Software/Utr/. Farther on this
page, you will find direct links to the current versions of both
software packages.
Further down on this page:
What are they for?
- Spellchecking
- Diacritic restoration
- Morphological Analysis and Synthesis
- Acquisition of data for morphological descriptions
- Perfect Hashing
Available Packages
Both packages are written in C++, and they can be compiled with
g++. Problems may arise when compiling with different compilers due to
the use of templates that tend not to be implemented universally. Both
packages use compact dictionaries that are automata (of different forms)
An interface in elisp is provided for both of them. The interface worked
with emacs19, but it is highly unlikely that it will work with emacs20
or later (specially with mule).
fsa - finite state automata
Current version number: 0.51. Man pages (except for fsa_synth)
in HTML format are also available as
one
file.
- fsa_accent - restores
diacritics. Useful for languages that have
diacritics. Fast program. Very compact dictionaries made from simple
word lists. Many different dictionaries may be used
simultaneously. Optional emacs interface (based on ispell.el).
- fsa_build - builds various
kinds of simple finite state automata. No regular expressions,
tables, etc. Every line of input is treated as one word. This
program requires the data to be lexicographically sorted and without
duplicates.
- fsa_ubuild - does the same
job as fsa_build, but more slowly and using more memory. However,
the data need not be sorted, and it may contain duplicates.
- fsa_guess - performs
morphological analysis of both known and unknown words. The analysis
for known words can be 100% correct, the unknown ones -
approximative, based on suffixes and prefixes. The
dictionary is built from a morphological dictionary for a given
language, so once such a dictionary is available, no special
linguistic knowledge is required. There is also a mode for guessing
morphological descriptions in mmorph (a morphology tool from ISSCO) format. A Tcl/Tk
interface is provided that facilitates the task of adding new words
to a dictionary. You must have a core description first, and then
you can use the interface and fsa_guess to enlarge it.
- fsa_hash - implements perfect
hashing. Words in a sorted list of
words are numbered, and the program translates words to numbers, and
numbers to words. This can be used e.g. in indexing encyclopaedias.
- fsa_morph - performs
morphological analysis of words, i.e. for a
given inflected form, it gives the corresponding lexeme and
categories. Several dictionaries may be used at the same
time. Dictionaries are compact, data for them has a very simple format
(sample awk preparation scripts are present in the package), and the
program is very fast. Emacs interface for text annotation provided in
the package.
- fsa_prefix - prints all the
entries of a dictionary that begin with a given prefix.
- fsa_spell - spellchecks
words. Special care is taken to treat
sequences of characters that are pronounced the same as if they were
one character. This is important when looking for candidates for
replacements in Polish. Otherwise a user might think the word is not
in the dictionary, but it is correct.
- fsa_synth - generates surface
word forms given the canonical form and
features/tags/annotations. The latter can be specified using regular
expressions. Also, printing all surface forms for the given
canonical form is possible.
- fsa_visual - produces data
for visualization of automata with vcg
- a graph visualization tool. Useful for diagnostic purposes.
- There are also various scripts, mainly for converting data to
appropriate formats, but also for other purposes, like a Tcl/Tk
script that creates an interface for adding new words to a
morphological dictionary.
utr - transducers
Current version number: 0.10
- tr_accent - restores diacritics. Useful for languages that have
diacritics. Fast program, uses the same dictionaries as tr_spell and
tr_morph. Several dictionaries may be used at the same time. Emacs
interface provided (based on ispell.el).
- tr_morph - performs both morphological analysis and generation of
inflected forms. Fast. Several dictionaries may be used at the same
time. Dictionaries are the same as for tr_spell and tr_accent. Emacs
interface facilitates annotation of corpora.
- tr_prefix - lists all dictionary entries that have surface forms
beginning with a given string.
- tr_spell - spellchecks words. Fast. Uses the same dictionaries as
tr_accent and tr_morph. Several dictionaries may be used at the same
time. Special care is taken to treat sequences of characters that are
pronounced the same as if they were one character. This is important
when looking for candidates for replacements in Polish. Otherwise a
user might think the word is not the dictionary, but it is correct.
- tr_ubuild - builds transducers for tr_accent, tr_morph, tr_prefix,
and tr_spell. Transducers provide compact representation of
morphological data.
Dictionaries
Name extension conventions
To avoid confusion, I use some name extensions to indicate the contensts
of a dictionary:
- fsa
- a list of words compiled to a simple automaton - to be used
with fsa_accent, fsa_spell.
- fsm
- morphology in a form of a simple automaton - to be used
with fsa_morph.
- atg
- a list a tergo (i.e. inverted) of inflected forms with
their categories - to be used with fsa_guess compiled without
GUESS_LEXEMES or run with -a option. This is used to guess categories
of words from their endings.
- atl
- a list a tergo (i.e. inverted) of inflected forms with
lexemes and categories compiled to a simple automaton - to be used
with fsa_guess compiled with GUESS_LEXEMES, but without GUESS_PREFIX,
or with GUESS_PREFIX, but run with -p option. This is used to guess
corresponding lexemes and categories from word endings.
- atp
- a list a tergo (i.e. inverted) of inflected forms with
lexemes and categories compiled to a simple automaton - to be used
with fsa_guess compiled with both GUESS_LEXEMES and GUESS_PREFIX. In
this automaton prefixes are stored differently so that the automaton
is smaller, and some more generalizations are possible. This is used
to guess corresponding lexemes and categories from word endings and
beginnings.
- tr
- transducer - to be used with all programs that have names
beginning with tr_.
Available dictionaries
Note that the dictionaries were constructed long time ago, and with
fsa_build that used only a format available at that time. To use the
default set of compile options for the current version, you have to
rebuild the automata.
-
deutsch1.fsa.gz
- German word list from
ftp.informatik.tu-muenchen.de:/pub/doc/dict/. 7 bit only, umlauts
coded with following e, sharp s with ss. It is difficult
to convert them to 8 bit, as not every oe is o umlaut,
not every ss is sharp s, etc.
-
english.fsa.gz
- English word list from /usr/dict/words
-
francais.atl.gz
- French dictionary a tergo with categories and lexemes
from ISSCO
-
francais.fsa.gz
- French word list form
ISSCO
-
francais.fsm.gz
- French morphology (simple automaton) from
ISSCO
-
francais.tr.gz
- French transducer from ISSCO
-
french_moby.fsa.gz
- French word list from the
Moby Project
-
polski.fsa.gz
- Polish word list extracted from Dziennik Baltycki
articles
Some Other Finite-State Software I Wrote
Software Directly Related to Mine
More On-line Information on Finite State Automata
- Finite-state
algorithms - a list of algorithms related to finite-state
automata and directed acyclic graphs.
- Search The Computation and
Language E-Print Archive for articles
by Kemal Oflazer. I use a modified version of his spelling correction
algorithms.
- Do the same for Mehryar
Mohri. I do not use his algorithms, but they are interesting anyway.
You can find his articles and some software
at AT&T.
- You can find the most sophisticated transducers at Xerox. Actually, I was inspired by a
talk given by Lauri Karttunnen in Archamps. They have
a collection of
articles, and a page
dedicated to finite state automata with a cool demo.
- There is a
very impressive page by Gertjan van Noord describing a very
impressive set of FSA utilities (with visualization tools). I do not
use them, but they look far more sophisticated than mine. However,
the functionality is different. Several articles on FSAs can be
found there.
- A
collection of articles by Bruce W. Watson describing various automaton
minimization techniques and the FIRE Engine toolkit that implements
them (also available by ftp). Again, I do not use these algorithms; my
automata are minimized on-the-fly as they are created.
Bruce W. Watson is now at the Eindhoven Technical University and at
the University of Pretoria, South Africa. He used to have a web
server with his publications and his software, but that address has
been taken over by someone else.
- Tomasz
Kowaltowski and his team (Cláudio L. Lucchesi and Jorge Stolfi)
produce the smallest automata (measured in bytes). The small size
is obtained by special coding on the bit level. From version 0.17 of
the fsa package on, I use their compression of chains of states.
- Automata and transducers even more impressive (to me) than those by
Xerox are used in the
INTEX project. The
INTEX team also uses stack automata (capable of dealing with
context-free grammars). I have seen the system in action, and I was
amazed by the ease of use with which the user can enter new
information into the system, and the variety of output it can
produce. There is also a clone of INTEX - Unitex. It works in
Unicode, and unlike INTEX, it is available under GPL.
- You may want to read an article I wrote with Bruce Watson and Richard
Watson on the construction of minimal deterministic acyclic
finite-state automata. I use the algorithms in my programs.
- Stoyan Mihov independently invented my algorithm for sorted
data. In his article
he gives a proof of a very interesting feature of that algorithm
concerning its memory requirement. We prepared a joint publication in Computational Linguistics 26(1), 2000,
with Bruce and Richard and Stoyan (much revised version of the
Ankara paper with many corrections, and with Stoyan's
contribution).
- In general, you can look at my list of
publications for more papers on finite-state software I wrote.
- Vincent Le Maout wrote The Automaton
Standard Template Library for C++.
- Arnaud Adant extendend it with WFST.
- Finally, you may take a look at my
Ph.D.
dissertation. Here is the postscript
version.
- The Grail+
is a symbolic computation environment for finite-state machines,
regular expressions, and other formal language theory
objects. It has a page with links to similar software.
The bibliography from my Ph.D. thesis lists more
sources, but not all of them are available on-line. In the
HTML version I provided links to those
of them that are. A good source of
articles from this field is the Computational Linguistics, a journal
distributed to the members of the ACL.
Money
This software is free.
This software is provided "as is". If you have lost a million dollars by
using it, that was your million, not mine. I should not be liable for
any losses.
Starting from version 0.42, the fsa package can be distributed with
a GPL licence
except for one third party file that has its own copyright and
licence. Since I'm not sure how to understand the legalese gibberish
with regard to third party software, GPL is not included in the package.
Feedback
If you find bugs in those programs, please tell me about them. If you do
not, you will have to correct them yourself in all future versions!
Jan Daciuk
This page contains no Javascript. No need to analyze the source!