Consider an alphabet of two letters A={0,1}, and then the set A* of all finite words made up of 0’s and 1’s. We call any subset of this infinite set of words a language, L. The Chompsky Heirarchy structures the space of languages according to the types of “machines” that can recognize the languages. Here, a “machine” means a fancy decorated graph with specified start and end states and directed edges labeled with letters from the alphabet; sometimes we call these machines automata.

The way a machine reads a word is by following the letters of the word as they appear from left to right – starting at a start state – along the arrows of the graph. We say a machine accepts the word if there is such a path in the graph that ends at an end state; otherwise, it rejects it.

A machine recognizes a language if it accepts all of the words in the language and rejects all of the words not in the language. One example of a set of languages with a particular similarity is the set of regular languages – they are exactly those recognized by finite state machines!

A property one might want in a machine is that all of the edges coming out of a state have distinct letters labeling them. Such a machine is called deterministic. For any regular language, there is a unique minimal state deterministic finite automata – and having such a canonical representative of a language is very useful! Yet requiring a machine to be deterministic may require it to have exponentially more states than a non-deterministic representation. Finding canonical Non-deterministic Finite Automata is an active area of research (see one paper I’ve been interested in on Canonical Nondeterministic Automata).

One particular canonical NFA for a given language is the A`tomaton, which can be found by taking the reverse of the minimal DFA of the reverse language. This simple algorithm is due to Brzozowski and Tamm. The utility of this NFA comes from the states being unions of atoms. We posit that such an algorithm, when applied to the case of epsilon machines – an analogous graphical representation of stationary stochastic processes – can result in a worthwhile canonical representation.

Collaborators: Ryan James