A finite-state automaton is a device that can be in one of a finite
number of states . In certain conditions, it can switch to
another state. This is called a transition .
When the automaton starts working (when it is switched on), it can be
in one of its initial states. There is also another important subset
of states of the automaton: the final states. If the automaton is in a
final state when it stops working, it is said to accept its
input. The input is a sequence of symbols. The interpretation of the
symbols depends on the application; they usually represent events, or
can be interpreted as ``the event that a particular data became
available''. The symbols must come from a finite set of symbols,
called the alphabet . If a particular symbol in a
particular state triggers a transition from that state to another one,
that transition is labeled with that symbol. The labels of
transitions can contain one particular symbol that is not in the
alphabet. A transition is labeled with (not present in the
alphabet) if it can be traversed with no input symbol.
It is convenient to present automata as directed graphs. The vertices denote states. They are portrayed as small circles. The transitions form the edges - arcs with arrows pointing from the source state (the state where the transition originates) to the target state. They are labeled with symbols. Unless it is clear from the context, the initial states have short arrows that point to them from ``nowhere''. The final states are represented as two concentric circles.
Figure 2.1: Sample finite-state automaton
The example automaton (fig. 2.1) contains 6 states labeled A, B, C, D, E and F. The state labeled A is initial. The states A, C, and D are final. The automaton contains no cycles, as it is impossible to leave a state, follow some transitions, and return in this way to the same original state.
Starting in one of the initial states, the automaton processes a
sequence of input symbols. In a given state, it checks if the next
input symbol matches any of the labels of the transitions that go out
of the state. If so, the automaton follows it to the target state of
the transition. If is among the labels, the transition it
labels is followed without reading the input. There may be more than
one transition with the same label, or a transition with a matching
label and a transition labeled with
. In that case, the
decision which path to choose is arbitrary. If there is no transition
that matches the input symbol (in particular, if there is no
-transition), the automaton stops and rejects the input. The
input is accepted when all input is read and matched by transitions
and the automaton is in a final state.
A set of sequences of symbols gathered by pursuing all paths from
initial states to the final ones is called the
language
accepted by the
automaton. E.g. the example automaton (fig. 2.1)
accepts the language . The symbol
, meaning an empty input, is accepted because the initial
state A is also the final one. a is accepted, because
there is a transition from A to C labeled with a,
and C is a final state. The input bc is accepted,
because the control in the automaton can go from A to F on
a, and then it can follow the
-labeled transition to
B, and then a transition labeled c to D, which is a
final state. The sequences: {ad, ae, af, ba} are all rejected,
because E is not a final state. The language of the automaton
in fig. 2.1 is finite, because the automaton
contains no cycles.
An automaton that has no -labeled transitions is called
-free. An automaton that has one initial state, is
-free, and for each state there is at most one transition
labeled with the same symbol is called a deterministic finite-state
automaton .
The automaton from fig. 2.1 can be transformed into
the deterministic automaton from fig. 2.2.
Figure 2.2: Deterministic version of the sample automaton
A state is said to be reachable from another state iff there is a sequence of transitions leading to the state from that other state. An automaton is said to be start-useful iff every state is reachable from any of its initial states. An automaton is useful iff it is start-useful and for every state there is at least one final state reachable from that state. The automaton from the fig. 2.2 is not useful; its useful version is presented in fig. 2.3. That version accepts the same language.
Figure 2.3: Useful version of the sample deterministic automaton
An automaton is said to be acyclic when it contains no cycles, i.e. it is not possible to arrive at the same state twice when following the transitions. All automata in figures presented in this dissertation are acyclic.
Formally, an automaton is a 5-tuple , where Q is a set of states,
is a set of initial (or starting)
states,
is a set of final states,
is the
alphabet , and
is a partial mapping
denoting transitions. We extend the
mapping to
. The size of the
automaton, |M|, is equal to the number of states |Q|. We define
the language
accepted by the automaton M to be:
Let the mapping be the right
language of a state
q in M (the set of all strings over
that will take one
from state q to any final state of M using the extended transition
relation
):
The consequence of the Myhill-Nerode theorem ([HU79]) is that among many automata that accept a given language there is one (up to isomorphism) that has a minimal number of states. It is called a minimal automaton . The automaton in fig. 2.3 is minimal.
The property of minimality of is defined as follows ([Wat93a], [Wat93b], [Wat95]):
We will use, however, an alternative definition of
minimality of
([Wat93a],
[Wat93b]):
is a
property specifying that all states can be reached from the start
state, and it is defined as follows:
Of particular interest for the NLP community are deterministic acyclic
finite state automata (DAFSA)
, because they are best suited for
representing dictionaries (of various sorts). Whenever we refer to an
automaton (or FSA) throughout this paper, we mean an acyclic, deterministic
finite state automaton (DAFSA), unless explicitly stated otherwise. Several
definitions given above can be rewritten in a simpler way: a
deterministic, finite-state automaton is a 5-tuple
, where Q is a set of states,
is the starting state,
is a set of final states,
is the alphabet , and
is a partial
mapping
denoting
transitions. We extend the
mapping to
as in [HU79]. The
definitions of the language accepted by an automaton, the right
language of a state, and of the
property for DAFSA
become simpler:
It is possible to use alternative definition of an FSA: , where the meaning of Q,
, i, and F is
the same as above, and E is a relation
-
an alternative expression of the
function from above. This
alternative definition may be more useful in descriptions below.
The algorithms described in this paper operate on classical
DAFSA (deterministic acyclic finite-state automata). However, our
implementation uses a different kind of automata:
finite-state automata with final transitions (FSA-FT)
.
We define such automaton as , where Q is the
finite set of states,
is the alphabet ,
is the set of the initial states of the automaton,
is a set of transitions, and
is the set of final transitions. We say that the FSA-FT accepts a word
if for each symbol from the input a transition was found, and the last
transition it traversed while recognizing the word was final.
We define a transformation of a classical finite-state
automaton into a finite-state automaton with final transitions: for every
FSA we find an FSA-FT
,
where
, i' is the new initial state,
, and
.
Let to be the transitive closure of F defined by the
following recursive relation:
Now we can define the language of the FSA-FT to be:
For every deterministic acyclic finite-state automaton
that does not accept empty input (written as
),
i.e.
, or
, we can find a deterministic acyclic finite-state automaton with
final transitions (DAFSA-FT)
,
, where
. We will call this transformation moving
markers . It preserves the number of states and
the number of transitions. Because such a transformation exists, and
it is defined for the minimal
automata , we can write that
.
Minimal deterministic acyclic finite-state automata with final transitions can indeed be smaller than their classical counterparts (see fig. 2.4). The definition of minimality for an FSA-FT is the same as for the classical FSA, but the definition of the right language of a state in FSA-FT must change:
Figure 2.4: Classical minimal deterministic acyclic FSA and the
corresponding FSA with final transitions
Note that for DAFSA-FT . This is different from the
classical automata, where
. That difference is very
important. It means that the recognition of a word is moved from a
final state in the classical automaton to the transition reaching
that state in the FSA-FT, resulting in the removal of the empty
string (
) from the right language of that automaton. Let
be a DAFSA (a classical one), and
be a DAFSA-FT obtained by the transformation
moving markers . We can write:
If we can find two states in the classical FSA that have the same right languages except for the empty string, we can merge the states together in a deterministic acyclic finite-state automaton with final transitions. This makes the resulting automaton smaller, i.e. it has less states and less transitions than the corresponding classical finite-state automaton.
Note that as most implementations of automata have implicit states (only transitions are represented explicitly), this new kind of automata saves space at no cost.
The automata with final transitions can be seen as Mealy's automata where the property of being final is the output label. Therefore, they can be perceived more as a notational convention than a new kind of automata. However in transducers, that property would have to be concatenated with the output labels.