Index: lispref/text.texi
===================================================================
RCS file: /cvsroot/emacs/emacs/lispref/text.texi,v
retrieving revision 1.57
diff -c -r1.57 text.texi
*** lispref/text.texi	13 Sep 2002 19:36:55 -0000	1.57
--- lispref/text.texi	7 Nov 2002 17:34:11 -0000
***************
*** 59,64 ****
--- 59,65 ----
  * Base 64::          Conversion to or from base 64 encoding.
  * MD5 Checksum::     Compute the MD5 ``message digest''/``checksum''.
  * Change Hooks::     Supplying functions to be run when text is changed.
+ * State Machines::   Writing reader functions.
  @end menu
  
  @node Near Point
***************
*** 3779,3781 ****
--- 3780,4267 ----
  
  This variable is available starting in Emacs 21.
  @end defvar
+ 
+ @node State Machines
+ @section Writing simple reader functions
+ @cindex state machines
+ 
+ A state machine is a model to describe complex control structures.  A
+ state machine is always in a certain state, it receives some input,
+ possibly generates some output, then it switches to another state,
+ where it repeats.
+ 
+ State machines are useful in a wide range of computing tasks.  In fact
+ a computer itself is only a very complex state machine.  One area,
+ however, where they are frequently used, is for reading and parsing
+ textual input.  Emacs provides the macro @code{run-state-machine} to
+ construct state machines for the purpose of reading text from a
+ buffer.
+ 
+ @menu
+ * State Machine Basics::     A short introduction to state machines.
+ * Example State Machine::    A basic example.
+ * Defining State Machines::  Defining state machines for reader 
+                              functions.
+ * Character Based Reading::  Switching states based on the character
+                              after point.
+ * Regexp Based Reading::     Switching states based on the buffer text
+                              after point.
+ * State Machine Notes::      Caveats and notes on performance.
+ @end menu
+ 
+ @node State Machine Basics
+ @subsection State Machines -- an Introduction
+ 
+ A traffic light may serve as a very basic example of a state
+ machine.  It has four states: a red state, a green state and two yellow
+ states.  There is one single input event, the signal ``Change the
+ colour now!''.  As ``output'' it turns certain electric bulbs on or
+ off.
+ 
+ @example
+ @group
+              +-------+
+     ,--------|  red  |<-------.
+     |        +-------+        |
+     |                         |
+     |                         |
+     v                         |
+ +-------+                 +-------+
+ |yellow1|                 |yellow2|
+ +-------+                 +-------+
+     |                         ^
+     |                         |
+     |        +-------+        |
+     `------->| green |--------´
+              +-------+
+ @end group
+ @end example
+ 
+ When a state machine receives input, it decides which state should be
+ the next state.  In a traffic light this decision is very simple,
+ because in each state there is only one type of input event and one
+ transition to another state possible: when the traffic light is in the
+ state ``red'' and it receives valid input, it switches to the state
+ ``yellow1'' unconditionally.  A more complex state machine would allow
+ more types of input events and would depend the decision about the
+ next state not only on the current state, but also on the
+ @emph{specific} input event.  A state machine that is made to read
+ textual input would typically depend the switching to the next state
+ on the actual character it received.
+ 
+ The way we describe the traffic light above, we assume that it was
+ always running and that it will run forever.  This is, of course, also
+ not true for a reader.  Most state machines would define one state as a
+ starting state and one or more states as final ones; so that the state
+ machine terminates after switching to a state declared to be a final
+ one.
+ 
+ @node Example State Machine
+ @subsection A Basic Example of a State Machine
+ 
+ The macro @code{run-state-machine} provides an easy way to define and
+ run state machines in Emacs Lisp for the purpose of reading input from
+ a buffer.  Here is a basic usage example as an introduction.
+ 
+ Suppose we need a function that parses the content of a buffer for
+ strings.  Each time we call this function, it should scan the buffer
+ for a quotation mark and return the buffer content up to the next
+ quotation mark, thereby moving point forward.  We could use this
+ function to get every string from the buffer if we call it often
+ enough.  Here is how it could be done with @code{run-state-machine}:
+ 
+ @lisp
+ @group
+ (defun read-next-string ()
+   (run-state-machine ()
+     (look-for-string (?\" nil t read-string)
+ 		     (t   nil t look-for-string))
+     (read-string (?\" nil t exit)
+ 		 (t   t   t read-string))))
+ @end group
+ @end lisp
+ 
+ This state machine has only two states, @code{look-for-string} and
+ @code{read-string}.  In each state the machine looks at the character
+ after point and decides whether it should stay in the same state or
+ switch to another one.  In both cases, however, it moves point forward
+ by one character.  The possible transitions to other states are
+ defined as lists after the name of the state.  Here each state has two
+ of them.
+ 
+ The starting state is @code{look-for-string}, because it is the first
+ one listed.  When the machine is in the state @code{look-for-string},
+ it checks wether the character after point is a @samp{"}.  This is
+ indicated by the @code{?\"} at the beginning of the first transition.
+ If it is, it switches to the state @code{read-string}.  If the
+ character after point is anything else but a @samp{"} (indicated by
+ the symbol @code{t} at the beginning of the second transition), the
+ machine stays in the state @code{look-for-string}.
+ 
+ When the state machine is in the state @code{read-string} it checks
+ again wether the character after point is a @samp{"}.  But this time
+ it terminates if it is.  This is specified by the special state name
+ @code{exit}.  In all other cases the string it currently reads is not
+ finished, so the machine stays it in the the state @code{read-string}.
+ 
+ On each transition to another state the state machine may store the
+ character after point internally.  The concatentation of all stored
+ characters is the return value of @code{run-state-machine}.  The
+ second element in the definition of a transition specifies if the
+ character after point should be appended to the return value in this
+ way or not.  In our example it is t only in the second transition of
+ @code{read-string}, because we want to return only characters between
+ two @samp{"}.  The third element indicates if the state machine should
+ move point forward in the buffer; here it is @code{t} in all
+ transitions.
+ 
+ This is a diagram of the states and transitions:
+ @example
+ @group
+                           Start
+                             |
+                             |
+                             |
+                             v
+                     +-------------------+
+                     |look-for-string    |<---.  Character is not a
+                     +-------------------+    |  quotation mark.
+                             |     |          |  * Do not store it.
+       Character is a        |     `----------´  * Move point forward.
+       quotation mark.       |
+       * Do not store it.    |
+       * Move point forward. |
+                             v
+                     +-------------------+
+                     |read-string        |<---.  Character is not a
+                     +-------------+-----+    |  quotation mark.
+                             |     |          |  * Store it.
+       Character is a        |     `----------´  * Move point forward.
+       quotation mark.       |
+       * Do not store it.    |
+       * Move point forward. |
+                             v
+                           Exit
+ @end group
+ @end example
+ 
+ @node Defining State Machines
+ @subsection Defining State Machines
+ @cindex writing a reader
+ 
+ The macro @code{run-state-machine} provides a mini-language to define
+ state machines for reader-functions.  To define a state machine, you
+ basically list a bunch of states and specify what the machine should
+ do in the different states by listing and defining the transitions
+ from each state to following ones.  This is explained in more detail
+ in @ref{Character Based Reading} and @ref{Regexp Based Reading}.
+ 
+ @deffn Macro run-state-machine spec &rest states
+ This macro defines and executes a state machine.
+ 
+ On each transition from one state to the next one the state machine
+ may accumulate characters or strings from the current buffer.  When
+ the machine terminates, it returns these per default as a concatenated
+ string; it returns @code{nil}, when no characters or strings were
+ accumulated.
+ 
+ The first argument @var{spec}, if address@hidden, is a list of the
+ form
+ 
+ @example
+ (@var{result-variable} @var{end-of-buffer-expression}).
+ @end example
+ 
+ @var{result-variable} should be a symbol or @code{nil}.  The symbol is
+ used as the name of the variable that the state machine uses to store
+ its return value at run time.  Specify a symbol, if you want to access
+ this variable.  If you are not interested in what variable name the
+ state machine uses internally, specify @code{nil}.  In this case the
+ state machine still returns the accumulated characters or strings, but
+ it deals with them transparently.  Please note, that specifying a
+ symbol might not be necessary in many cases, because you can also
+ control the return value to a large extend by specifying a function as
+ @var{add-to-result} in the definition of a transition (see below).
+ 
+ When a state machine encounters the end of the buffer at run time, it
+ terminates and returns @code{nil} by default.  To change the
+ return-value, you can specify a Lisp expression as
+ @var{end-of-buffer-expression} that is executed in this case; the
+ return value of this expression becomes the return value of the state
+ machine.  To return the accumulated result in whatever state it may be
+ at that time, simply set @var{result-variable} to some symbol and
+ repeat that symbol as @var{end-of-buffer-expression}.  For example,
+ @code{(output output)} as @var{spec} leads to a state machine that
+ uses the variable @var{output} to store the result and returns the
+ value of @var{output} unmodified, when it terminates at the end of the
+ buffer.
+ 
+ @var{spec} may be @code{nil}.  This is equivalent to @code{(nil nil)}.
+ @var{spec} is followed by a number of one or more definitions of
+ states.
+ 
+ The definition of a state consists of the name of the state followed
+ by one or more definitions of transitions to following states.  The
+ starting state is always the first one listed.  Specifying the state
+ name @code{exit} as the following state in a transition means to
+ terminate processing; you can't define a state named @code{exit}.
+ 
+ A transition is a list of the form
+ 
+ @example
+ (@var{matcher} @var{add-to-result} @var{advance} @var{next-state})
+ @end example
+ 
+ @var{matcher} may be either a regular expression (@pxref{Regexp Based Reading}).
+ Or a single character (@pxref{Character Type}), a cons cell
+ of two characters indicating a range of characters or a list of
+ characters or character ranges indicating character
+ alternatives. @xref{Character Based Reading}.
+ 
+ If @var{add-to-result} is @code{t}, the character after point is
+ appended to the return-value (which--in the normal case--is a string).
+ If @var{advance} is @code{t}, the machine moves point forward by one
+ character.  Then it switches to the state specified by
+ @var{next-state}.  If @var{matcher} is a regular expression,
+ @var{add-to-result} and @var{advance} may be integers.  In this case
+ the integer specifies the subexpression of @var{matcher} which should
+ be added to the result or to whose end point should move,
+ respectively. @xref{Regexp Based Reading}.
+ 
+ @var{add-to-result} may be a function.  In this case the function
+ should take two arguments.  The first one being the current
+ return-value and the second one being the pending input-character (as
+ a string!).  The return value of the state machine is then set to the
+ value returned by this function.  If @var{advance} is a function, it
+ receives no argument.  This function takes the full resposibility for
+ moving point.  For example a transition
+ @code{(?a t t another-state)}
+ could be written equivalently (modulo performance) as
+ @code{(?a concat forward-char another-state)}.
+ @end deffn
+ 
+ @node Character Based Reading
+ @subsection Switching States Based on the Character after Point
+ 
+ In character-based reading, a state machine specified with
+ @code{run-state-machine} switches states based on the character after
+ point.
+ 
+ In the simple case @var{matcher} is either a character or the symbol
+ @code{t}.  To find the right transition, the state machine looks for a
+ transition whose @var{matcher} is @code{eq} to the character after
+ point.  If it finds none, it looks for a transition whose @var{matcher}
+ is @code{t}.
+ 
+ For example, a state could look like this:
+ 
+ @example
+ @group
+ ;;           MATCHER ADD-TO-RESULT ADVANCE NEXT-STATE
+ (look-for-a (?a      t             t       another-state-1)
+             (t       nil           nil     another-state-2))
+ @end group
+ @end example
+ 
+ When the state machine is in the state @code{look-for-a}, it checks if
+ the character after point is an @samp{a}.  If it is, it adds an
+ @samp{a} to the return value, moves point forward and switches to the
+ state @code{another-state-1}.  Else, it does not add anything to the
+ return value, leaves point where it is and switches to the state
+ @code{another-state-2}.
+ 
+ You can specify alternative characters or a range of characters as
+ @var{matcher} in a transition.  To specify a range of characters,
+ define a cons cell of the first and the last character.  @code{(?a
+ . ?z)} as @var{matcher} matches all characters from @samp{a} up to
+ @samp{z}.  To specify alternative matches, use a list of characters or
+ of character ranges.  For example @code{(?a ?b ?c)} as @var{matcher}
+ matches the characters @samp{a}, @samp{b} or @samp{c}, while
+ @code{((?0 . ?9) ?-)} matches digits and hyphens.
+ 
+ As an extended example, here is the definition of a reader-function
+ @code{example-reader} that returns everything inside @samp{"} as a
+ string, digits as integers and everything else as a symbol.  Each time
+ the function is called it reads the next object from the current
+ buffer and returns it according to its type.  If it encounters the end
+ of the buffer, it raises an error.
+ 
+ @smalllisp
+ @group
+ (defun exmpl-make-number (output ignore)
+   (string-to-number output))
+ @end group
+ 
+ @group
+ (defun exmpl-make-symbol (output ignore)
+   (make-symbol output))
+ @end group
+ 
+ @group
+ (defun example-reader ()
+   (run-state-machine (nil (error "End of buffer"))
+     ;; @r{Starting state.}
+     (start (?\" nil t read-string)
+            ((?0 . ?9) t t read-integer)
+            ((?\n ?\t ?\ ) nil t skip-white-space)
+            (t t t read-symbol))
+     ;; @r{Skip white-space}
+     (skip-white-space ((?\t ?\n ?\ ) nil t skip-white-space)
+                       (t nil nil start))
+     ;; @r{Read a string: add everything to the return value up to the}
+     ;; @r{next `"'.  Then exit.}
+     (read-string (?\" nil t exit)
+                  (t t t read-string))
+     ;; @r{Read an integer: add every digit to the return value.  Every}
+     ;; @r{other character causes the machine to exit.  Convert the}
+     ;; @r{return-value to an integer upon exit.}
+     (read-integer ((?0 . ?9) t t read-integer)
+                   (t exmpl-make-number nil exit))
+     ;; @r{Read a symbol: read everything which is not a digit, a `"' or a}
+     ;; @r{white-space character.  Return a symbol.}
+     (read-symbol ((?\" ?\  ?\t ?\n (?0 . ?9))
+                   exmpl-make-symbol nil exit)
+                  (t t t read-symbol))))
+ @end group
+ @end smalllisp
+ 
+ @node Regexp Based Reading
+ @subsection Switching States Based on the Buffer Text after Point
+ 
+ Instead of a character (or a list of character-alternatives) or the
+ symbol @code{t}, @var{matcher} may be a regular expression.
+ @xref{Regular Expressions}.  In this case the state machine checks for
+ a matching transition by applying the regexp with @code{looking-at} to
+ the text after point.
+ 
+ When @var{matcher} is a regexp @var{add-to-result} and @var{advance}
+ may be integers.  An integer as @var{add-to-result} specifies the
+ subexpression which is added to the return-value.  An integer as
+ @var{advance} specifies to the end of which subexpression point should
+ move.  In both cases @code{t} is interpreted as @code{0}.
+ 
+ As an example, suppose we have a silly database file in which we store
+ information about persons, animals and text editors.  Now we want to
+ extract the names of persons by consecutive calls to a function
+ @code{read-person-name}; i. e. the function should skip the records
+ for animals and text editors as well as records for persons where the
+ name field is omitted.  The entries could look like this:
+ 
+ @example
+ @group
+ Type Animal
+ Name Frog
+ Colour Green
+ @end group
+ 
+ @group
+ Type Person
+ # name unknown
+ Colour Red # colour of hair
+ Number 12345 # postal code
+ @end group
+ 
+ @group
+ Type Text-Editor
+ # The One True Editor
+ Name Emacs
+ Number 21.4 # version
+ @end group
+ 
+ @group
+ Type Person
+ Name Tars Tarkas # This is the name we want.
+ Colour Green
+ @end group
+ 
+ @end example
+ 
+ To simplify things, we assume that the entry for each field is on a
+ line of it's own, so we can do the parsing line-wise.  Fields may
+ occur in abitrary order, except for the @samp{type} field, which must
+ be the first field in the record.  Each record starts with a
+ declaration of the @samp{type}; @samp{#}s are comment-characters;
+ empty lines and leading whitespace are legal, but not significant;
+ case is also insignificant.
+ 
+ @lisp
+ @group
+ (defun read-person-name ()
+   "Read the next name in the category \"Person\"."
+   (let ((case-fold-search t)) ; ignore case
+     (run-state-machine ()
+       (look-for-person
+        ("\\s-*type\\s-+person" nil forward-line get-name) ; @r{person found}
+        (t nil forward-line look-for-person)) ; @r{keep on searching}
+       (get-name
+        ("\\s-*type" nil nil look-for-person) ; @r{no name was provided}
+        ("\\s-*name\\s-+\\(.*?\\)\\s-*#.*$" 1 forward-line exit) ; @r{name found}
+        (t nil forward-line get-name))))) ; @r{keep on searching}
+ @end group
+ @end lisp
+ 
+ This function parses the text line by line, because @var{advance} is
+ either @code{nil} or the function @code{forward-line}; one effect of
+ this is that all the regular expressions match from the beginning of
+ the line.  When it is in the state @code{look-for-person} the function
+ moves forward in the buffer until it finds a record that starts with
+ @samp{type person}.  Then it switches to the state @code{get-name}.
+ Again it moves point forward line by line until it finds a @samp{name}
+ field.  When it finds one, the state machine terminates and returns
+ the name; this happens in the second transition of @code{get-name}:
+ the regexp is constructed in such a way that characters after
+ @samp{name} and before any @samp{#} match the first subexpression,
+ which is added to the return value, because @var{add-to-result} is
+ @code{1} in this transition.
+ 
+ But when the state machine encounters another @samp{type} field
+ indicating a new record before it finds a name, it switches back to
+ the state @code{look-for-person}.  This is specified in the first
+ transition of @code{get-name}. (@var{advance} is @code{nil} here,
+ because @samp{type} could be @samp{person} again.  Then the state
+ @code{look-for-person} should switch back to @code{get-name}
+ immediately.) So if point is at the beginning of the first record in
+ the example (@samp{Type Animal}), the first call to
+ @code{read-person-name} will return @samp{"Tars Tarkas"}, although
+ there is another person in between.  But this other record provides no
+ name and is therefore ignored.
+ 
+ @node State Machine Notes
+ @subsection Caveats and Notes on Performance
+ 
+ @enumerate
+ 
+ @item
+ Although it is possible to apply both transitions with regular
+ expressions and with characters as @var{matcher} in one and the same
+ state machine, this might add some undesirable additional overhead,
+ especially if the majority of transitions has characters as
+ @var{matcher}.  In this case it might be worth the extra effort to get
+ rid of regexp entirely for the sake of performance.  This is due to
+ the internal handling of the transitions.  As far as this note is
+ concerned, character alternatives and character ranges count as
+ @var{matcher} of the type ``character''.
+ 
+ @item
+ The checking for transitions with regexps as @var{matcher}, characters
+ as @var{matcher} and the default transitions is done independently and
+ in this order: regexps first, defaults last.
+ 
+ @item
+ The order of transitions in the definition of a state is significant.
+ If any two transitions whose @var{matcher} is of the same type would
+ match in the current current buffer, the state machine chooses the
+ transition that comes first in the state-definition.  For example if
+ one transition has @code{?y} as @var{matcher} and another transition
+ has @code{(?a . ?z)} as @var{matcher} and the character after point is
+ a @samp{y}, then the transition that comes first in the definition of
+ the state is chosen.  @emph{But}: regexps as @var{matcher} come always
+ before characters as @var{matcher} and defaults come always last,
+ regardless of where they were defined.
+ 
+ @item
+ The macro @code{run-state-machine} does quite some computation on each
+ expansion.  Therefore it is strongly recommended to byte-compile a
+ package that uses it.
+ 
+ @end enumerate