Po polsku
This page contains no Javascript. No need to analyze the source!

Important note

Due to problems with GUT WWW servers, and due to GUT's planned switch to another internet domain, this page has been moved to a new location: http://www.jandaciuk.pl/fsa.html. Note that the software repository on an FTP server had to be moved as well, as not only it will be affected by the domain switch (the current domain will be abandoned!), but GUT's FTP server crew is unreliable (they switched off the server for several month at the beginning of 2016).

Finite state utilities

This page describes two software packages and some accompanying files available from http://www.jandaciuk.pl/Software/Fsa/ and http://www.jandaciuk.pl/Software/Utr/. Farther on this page, you will find direct links to the current versions of both software packages.

Further down on this page:

What are they for? (Purpose of the packages)
Available Packages
- fsa - finite-state automata
- utr - transducers
Dictionaries
- Name extension conventions
- Available Dictionaries
Some Other Software I Wrote
Software Directly Related to Mine
More On-line Information on Finite State Automata
Money
Feedback

What are they for?

Spellchecking
Diacritic restoration
Morphological Analysis and Synthesis
Acquisition of data for morphological descriptions
Perfect Hashing

Available Packages

Both packages are written in C++, and they can be compiled with g++. Problems may arise when compiling with different compilers due to the use of templates that tend not to be implemented universally. Both packages use compact dictionaries that are automata (of different forms) An interface in elisp is provided for both of them. The interface worked with emacs19, but it is highly unlikely that it will work with emacs20 or later (specially with mule).

fsa - finite state automata

Current version number: 0.51. Man pages (except for fsa_synth) in HTML format are also available as one file.

fsa_accent - restores diacritics. Useful for languages that have diacritics. Fast program. Very compact dictionaries made from simple word lists. Many different dictionaries may be used simultaneously. Optional emacs interface (based on ispell.el).
fsa_build - builds various kinds of simple finite state automata. No regular expressions, tables, etc. Every line of input is treated as one word. This program requires the data to be lexicographically sorted and without duplicates.
fsa_ubuild - does the same job as fsa_build, but more slowly and using more memory. However, the data need not be sorted, and it may contain duplicates.
fsa_guess - performs morphological analysis of both known and unknown words. The analysis for known words can be 100% correct, the unknown ones - approximative, based on suffixes and prefixes. The dictionary is built from a morphological dictionary for a given language, so once such a dictionary is available, no special linguistic knowledge is required. There is also a mode for guessing morphological descriptions in mmorph (a morphology tool from ISSCO) format. A Tcl/Tk interface is provided that facilitates the task of adding new words to a dictionary. You must have a core description first, and then you can use the interface and fsa_guess to enlarge it.
fsa_hash - implements perfect hashing. Words in a sorted list of words are numbered, and the program translates words to numbers, and numbers to words. This can be used e.g. in indexing encyclopaedias.
fsa_morph - performs morphological analysis of words, i.e. for a given inflected form, it gives the corresponding lexeme and categories. Several dictionaries may be used at the same time. Dictionaries are compact, data for them has a very simple format (sample awk preparation scripts are present in the package), and the program is very fast. Emacs interface for text annotation provided in the package.
fsa_prefix - prints all the entries of a dictionary that begin with a given prefix.
fsa_spell - spellchecks words. Special care is taken to treat sequences of characters that are pronounced the same as if they were one character. This is important when looking for candidates for replacements in Polish. Otherwise a user might think the word is not in the dictionary, but it is correct.
fsa_synth - generates surface word forms given the canonical form and features/tags/annotations. The latter can be specified using regular expressions. Also, printing all surface forms for the given canonical form is possible.
fsa_visual - produces data for visualization of automata with vcg - a graph visualization tool. Useful for diagnostic purposes.
There are also various scripts, mainly for converting data to appropriate formats, but also for other purposes, like a Tcl/Tk script that creates an interface for adding new words to a morphological dictionary.

utr - transducers

Current version number: 0.10

tr_accent - restores diacritics. Useful for languages that have diacritics. Fast program, uses the same dictionaries as tr_spell and tr_morph. Several dictionaries may be used at the same time. Emacs interface provided (based on ispell.el).
tr_morph - performs both morphological analysis and generation of inflected forms. Fast. Several dictionaries may be used at the same time. Dictionaries are the same as for tr_spell and tr_accent. Emacs interface facilitates annotation of corpora.
tr_prefix - lists all dictionary entries that have surface forms beginning with a given string.
tr_spell - spellchecks words. Fast. Uses the same dictionaries as tr_accent and tr_morph. Several dictionaries may be used at the same time. Special care is taken to treat sequences of characters that are pronounced the same as if they were one character. This is important when looking for candidates for replacements in Polish. Otherwise a user might think the word is not the dictionary, but it is correct.
tr_ubuild - builds transducers for tr_accent, tr_morph, tr_prefix, and tr_spell. Transducers provide compact representation of morphological data.

Dictionaries

Name extension conventions

To avoid confusion, I use some name extensions to indicate the contensts of a dictionary:

fsa: a list of words compiled to a simple automaton - to be used with fsa_accent, fsa_spell.
fsm: morphology in a form of a simple automaton - to be used with fsa_morph.
atg: a list a tergo (i.e. inverted) of inflected forms with their categories - to be used with fsa_guess compiled without GUESS_LEXEMES or run with -a option. This is used to guess categories of words from their endings.
atl: a list a tergo (i.e. inverted) of inflected forms with lexemes and categories compiled to a simple automaton - to be used with fsa_guess compiled with GUESS_LEXEMES, but without GUESS_PREFIX, or with GUESS_PREFIX, but run with -p option. This is used to guess corresponding lexemes and categories from word endings.
atp: a list a tergo (i.e. inverted) of inflected forms with lexemes and categories compiled to a simple automaton - to be used with fsa_guess compiled with both GUESS_LEXEMES and GUESS_PREFIX. In this automaton prefixes are stored differently so that the automaton is smaller, and some more generalizations are possible. This is used to guess corresponding lexemes and categories from word endings and beginnings.
tr: transducer - to be used with all programs that have names beginning with tr_.

Available dictionaries

Note that the dictionaries were constructed long time ago, and with fsa_build that used only a format available at that time. To use the default set of compile options for the current version, you have to rebuild the automata.

deutsch1.fsa.gz: German word list from ftp.informatik.tu-muenchen.de:/pub/doc/dict/. 7 bit only, umlauts coded with following e, sharp s with ss. It is difficult to convert them to 8 bit, as not every oe is o umlaut, not every ss is sharp s, etc.
english.fsa.gz: English word list from /usr/dict/words
francais.atl.gz: French dictionary a tergo with categories and lexemes from ISSCO
francais.fsa.gz: French word list form ISSCO
francais.fsm.gz: French morphology (simple automaton) from ISSCO
francais.tr.gz: French transducer from ISSCO
french_moby.fsa.gz: French word list from the Moby Project
polski.fsa.gz: Polish word list extracted from Dziennik Baltycki articles

Some Other Finite-State Software I Wrote

finite-state library for NLP, based on a subset of the fsa package, but including compressed language models. Requires the fsa package.
incremental minimization of finite-state automata, and other minimization algorithms for comparison.
various algorithms for construction of acyclic finite-state automata.

Software Directly Related to Mine

The fsa package and related fadd library are used in Alpino - a wide-coverage dependency parser for Dutch.
Dawid Weiss translated most parts of the fsa package (except automata construction) to Java. His collection of Java classes is used in his Polish stemmer project.
Guillaume Rousse has written a perl wrapper around my fsa package.
Marcin Miłkowski maintains an on-line spelling dictionary that uses fsa_spell from my fsa package.

More On-line Information on Finite State Automata

Finite-state algorithms - a list of algorithms related to finite-state automata and directed acyclic graphs.
Search The Computation and Language E-Print Archive for articles by Kemal Oflazer. I use a modified version of his spelling correction algorithms.
Do the same for Mehryar Mohri. I do not use his algorithms, but they are interesting anyway. You can find his articles and some software at AT&T.
You can find the most sophisticated transducers at Xerox. Actually, I was inspired by a talk given by Lauri Karttunnen in Archamps. They have a collection of articles, and a page dedicated to finite state automata with a cool demo.
There is a very impressive page by Gertjan van Noord describing a very impressive set of FSA utilities (with visualization tools). I do not use them, but they look far more sophisticated than mine. However, the functionality is different. Several articles on FSAs can be found there.
A collection of articles by Bruce W. Watson describing various automaton minimization techniques and the FIRE Engine toolkit that implements them (also available by ftp). Again, I do not use these algorithms; my automata are minimized on-the-fly as they are created.
Bruce W. Watson is now at the Eindhoven Technical University and at the University of Pretoria, South Africa. He used to have a web server with his publications and his software, but that address has been taken over by someone else.
Tomasz Kowaltowski and his team (Cláudio L. Lucchesi and Jorge Stolfi) produce the smallest automata (measured in bytes). The small size is obtained by special coding on the bit level. From version 0.17 of the fsa package on, I use their compression of chains of states.
Automata and transducers even more impressive (to me) than those by Xerox are used in the INTEX project. The INTEX team also uses stack automata (capable of dealing with context-free grammars). I have seen the system in action, and I was amazed by the ease of use with which the user can enter new information into the system, and the variety of output it can produce. There is also a clone of INTEX - Unitex. It works in Unicode, and unlike INTEX, it is available under GPL.
You may want to read an article I wrote with Bruce Watson and Richard Watson on the construction of minimal deterministic acyclic finite-state automata. I use the algorithms in my programs.
Stoyan Mihov independently invented my algorithm for sorted data. In his article he gives a proof of a very interesting feature of that algorithm concerning its memory requirement. We prepared a joint publication in Computational Linguistics 26(1), 2000, with Bruce and Richard and Stoyan (much revised version of the Ankara paper with many corrections, and with Stoyan's contribution).
In general, you can look at my list of publications for more papers on finite-state software I wrote.
Vincent Le Maout wrote The Automaton Standard Template Library for C++.
Arnaud Adant extendend it with WFST.
Finally, you may take a look at my Ph.D. dissertation. Here is the postscript version.
The Grail+ is a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects. It has a page with links to similar software.

The bibliography from my Ph.D. thesis lists more sources, but not all of them are available on-line. In the HTML version I provided links to those of them that are. A good source of articles from this field is the Computational Linguistics, a journal distributed to the members of the ACL.

Money

This software is free.

This software is provided "as is". If you have lost a million dollars by using it, that was your million, not mine. I should not be liable for any losses.

Starting from version 0.42, the fsa package can be distributed with a GPL licence except for one third party file that has its own copyright and licence. Since I'm not sure how to understand the legalese gibberish with regard to third party software, GPL is not included in the package.

Feedback

If you find bugs in those programs, please tell me about them. If you do not, you will have to correct them yourself in all future versions!

Jan Daciuk
This page contains no Javascript. No need to analyze the source!