gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] arch roadmap 1 (and "what's tom up to")


From: Tom Lord
Subject: [Gnu-arch-users] arch roadmap 1 (and "what's tom up to")
Date: Mon, 28 Jun 2004 12:55:09 -0700 (PDT)

This is the first of three messages, the third of which will be 
delayed a little bit.

The messages begin to introduce furth, a new extension language that I
plan to add to tla (and use for other purposes as well).   They are:


        1. First Furth intro (for C hackers)
           What is furth and how does it work?

        2. First Furth intro (in general)
           What is furth and what does it look like?

        3. Arch Roadmap 
           Where's arch going and where does furth fit in.




                         Furth for C Hackers


  Furth is a new virtual machine, designed to be linked with other
  programs as an extension language, and to be used independently as a
  scripting language.

  The machine language of the Furth VM is, in its own right, a high
  level programming language: an alternative to Scheme, Python, Lua,
  Ruby, Perl, Java, etc. albeit one that feels "lower level" (and more
  general) than any of those.

  The name pays homage to Forth (obviously) and Scheme and Lisp.
  Scheme because many Scheme implementations have rabbit-themed names
  and here we have skinned some of those rabbits to produce FURs.
  Lisp, because furS, pronounced with a lisp.....  Finally, the name
  pays homage to furth itself because discovering fortuitous
  opportunities for parsimony in expression is kind of a design theme
  with furth.

  This is a quick introduction to what Furth is and how it works and
  why, from the point of view of a C programmer wanting to include
  furth in his programs.   A separate message introduces the langauge
  for Furth programmers, leaving C out of it.



* Allocating VMs

  To run furth code, you allocate a (reference-counted) instance of
  a furth VM:

        vm = fur_make_vm (0);


* Cycles, Primtive Operations, Writing Extensions 101

  To execute a furth program, you set up the registers of the 
  VM (described below) and then ask the VM to execute for
  some number of cycles:

        fur_run_vm (vm, 10);            /* run the vm for 10 cycles */

  Each cycle consists of the execution of a single 
  "primitive command".   Commands perform a high-level, 
  atomic transformation on the state of the VM and then
  the cycle ends.

  Furth virtual machines have a fixed set of _registers_.  One of
  these registers is called "$inx", the instruction register.  At the
  beginning of a cycle, $inx contains an instruction which designates
  what primitive instruction to run during that cycle.

  All commands accept a kind of (special purpose) parameter called 
  the "continuation" of the command.   The continuation is a
  representation of "the rest of the computation": of what to do after
  the command completes.   The continuation is passed in a second
  register, called "$next", the next instruction register.

  Most commands, as part of their transformation to the VM state, move
  the value in $next to $inx.

  So, now, what does `fur_run_vm' do?  [The following description is
  ripped off (and hopefully not too badly mangled) from Simon
  Peyton-Jones description of the Spineless, Tagless, G-Machine, a
  virtual machine for the Haskell programming language.]

  In an ideal world (where we didn't have to use C to implement Furth)
  `fur_run_vm' might be this (in pseudo-assembly code):


        ld  $cycles, <n-cycles>         # $cycles := <n-cycles>
        jmp $inx                        # jump through $inx

   Normal primitive commands would have the general form:

        command:        [...do stuf...]         # atomic update to the
                                                # VM state, usually
                                                # including:

                        ld $inx, $next          # $inx := $next

                        ld $next, :halt         # store the "halt" primitive
                                                # in $next

                        djz $cycles, all_done   # decrement $cycles,
                                                # jmp on 0

                        jmp $inx                # more cycles: jmp to
                                                 next 

        all_done:       ret                     # no more cycles, return
                                                # from fur_run_vm

  Unfortunately, such jumps can not be directly implemented in
  portable C.  Fortunately, they can be simulated using a loop.  The
  real implementation of `fur_run_vm' is closer to this:


        while (cycles-- > 0)
        {
          $inx->fn (vm);
        }

   and normal primitive commands have the general form:

       void
       command (t_fur_vm vm)
       {
          [...do stuff...];
          
          vm->$inx = vm->$next;         /* usually */
          vm->$next = :halt;
       }
  
  So the bulk of the work of adding a new primitive to Furth is quite
  trivial: just write a function of the above form (but using real
  code, not pseudo-code).   Piece of cake, really.


* Multi-tasking and Concurrent Execution

  Primitive commands are, by convention, "short-running" routines
  that allocate only a modest amount of memory.   It is fundamental to
  furth that primtive commands normally are _atomic_ meaning that
  no two of them run concurrently.

  A furth VM can, in fact, have more than one set of registers: it 
  can have an array of registers.   This is used for multi-tasking
  by changing `fur_run_vm' a little bit.   For example, here is
  a round-robin scheduler implementation of `fur_run_vm':


        int which_cpu = 0;

        while (cycles-- > 0)
        {
          select_registers_for (vm, which_cpu);

          $inx->fn (vm);

          which_cpu = ((which_cpu + 1) % n_cpus);
        }

  That scheduler performs a context switch ("select_registers_for") at
  the start of every cycle.   More interesting schedulers are also
  possible, of course.

  A furth implementation hosted on a truly multi-headed machine
  can run multiple instances of `fur_run_vm' concurrently 
  with some brute-force locking:

        int which_cpu = 0;

        while (cycles-- > 0)
        {
          critical_section(vm)    /* exclude other threads from this */
            {
               select_registers_for (vm, which_cpu);
               $inx->fn (vm);
            }

          which_cpu = ((which_cpu + 1) % n_cpus);
        }

  Furth commands are normally very high level operations yet also
  fairly short-running.  So, if that is all the locking actually
  needed, the above loop may not be quite as crazy as it might look
  at first glance: it's an easy to reason about semantic, it
  eliminates most need for locking within the primitive command
  implementations,  and while it does force serialization of command
  execution even given N real CPUs, properly coded, it should still
  return a performance benefit approaching that of an N-fold increase
  in cache sizes for a single real CPU --- nothing to sneeze at.


* Values and Locations

  We've seen two registers so far, $inx and $next.   Registers are
  an example of a _location_.   A location holds a _value_.

  A primitive command implementation that needs the Furth value for 
  the integer 42 might contain code like:

        t_fur_location loc42;

        fur_init_location (&loc42, vm);
        fur_make_int32 (&loc42, vm, 42);

        ....

        fur_clear_location (&loc42, vm);


  All values in Furth have the same abstract form.   Essentially, 
  all values have these parts, illustrated with pseudo-C:

     struct value
     {
        type_vtable     type;
        raw_data        uid;                    /* internal binary data */
        unsigned int    is_mutable:1;
        unsigned long   hash_value;
        dictionary      slots;
        struct value *  parent;
     };

  Thus you can ask the hash value for a given value or you can ask
  what it's "representation type" is.  The mutability of some values
  can change over time as can the slots and parent.  The hash value
  and type are fixed.

  The `slots' field functions similarly to array elements or structure
  fields.  Each slot has a name and a current binding.  If the value
  is mutable, bindings can be changed, added, and deleted.  Slot names
  and bindings are arbitrary values: any value may be used as either a
  slot name or a slot binding.

  The `parent' field "points to" (not quite literally) another value
  from which slots can be inherited (using a single-inheritence model).

  The `uid' encodes any data specific to an instance of a more general
  type, but not representable as a slot binding.   For example, all 
  small integer numbers are given the same `type' but each has a 
  distinct `uid' which you can think of as containing the binary
  form of the integer.

  The actual implementation of values, even in a naive implementation, 
  must optimize some cases.   For example, the slots whose names
  are the non-negative integers 0, 1, 2, etc. aren't stored in a
  `dictionary' in the conventional sense, but as a flat array.

  Some values are special because they are immutable and have no
  slots.    An example is the integer 42 which might be described
  in pseudo-code as:

        the_integer_value_42 = { type = integer_type,
                                 uid = "42",
                                 is_mutable = 0,
                                 hash_value = hash_of(42),
                                 slots = 0,
                                 parent = 0 }

  Implementations should take care to optimize the representation
  of such values.

  Because of all the needed optimizations, the real representation of
  values is fairly complicated and different implementations are
  possible.  Therefore, C programs using Furth internally never
  manipulate values directly -- never touch their fields, for example.


v** Standard, Total, Projections

  Every value, regardless of its type, may be converted by C programs
  into any of these types:

        t_int32         /* signed 32 bit integer */
        t_uint32        /* unsigned 32 bit integer */
        double
        void *          /* user-defined extension type */

   however, the conversion is not always meaningful.   For example,
   the `double' value of a "list" value is likely to just be 0.0.

   Still, if your code recognizes the type of the value, then 
   you'll know which conversions are meaningful.


** Array Like and Stack Like Values

  It was mentioned earlier that the slot names 0, 1, 2, etc. are
  treated specially by implementations:  those slots are stored
  as a flat array.

  If a value is mutable, the size of the array can be changed 
  dynamically.

  Therefore, mutable values can be regarded as (among other things)
  resizable arrays.  They can be used as stacks, as can be seen with
  the next register to be introduced (the $tuple register, just
  below).


** Every Value is a Command

  The register $inx is just a general purpose location.   _Any_ value
  can be stored there.   Therefore, _every_ value must have some
  interpretation when executed as a command and _every_ command must
  be representable as a value.



* The $tuple Register

  The tuple register normally contains a mutable value being used
  as a stack.    Some primitive commands manipulate that stack.

  As an example, small integer values are primitive commands which
  push themselves on the $tuple stack.

  In other words, if we set:


        $tuple := <empty-stack>
        $inx   := 42
        $next  := <X>

   then when the primitive command "42" is executed, the outcome is:

        $tuple := 42            ;; one element stack containing "42"
        $inx := <X>
        $next := :halt

   42 has pushed itself onthe stack.

  (Many obvious primitive commands are thus implied, e.g., for the
  arithmetic operators.)


* The $env Register

  All values act like dictionaries using arbitrary values as keys and
  bindings within that dictionary.   Bindings can be inherited in the
  usual single-threaded way from a distinguished "parent value".

  Furth uses this mechanism for many purposes, including to provide a
  mechanism for "object", "closures", and "local variables".

  The register $env normally contains a mutable value used for
  (depending on how you think about programming) the "this" object or
  the current lexical context (local variable set).  There are
  primitive commands to create and destroy bindings in the environment
  as well as to reference and set those bindings.

  Furth environments are fully dynamic: an update to a parent
  environment effectively propogates to children environments.
  Looking up, modifying, defining or deleting bindings can therefore
  be a fairly heavyweight operation and Furth implementations are 
  encouraged to optimize it as far as practical.


* The $env Register

  All values act like dictionaries using arbitrary values as keys and
  bindings within that dictionary.   Bindings can be inherited in the
  usual single-threaded way from a distinguished "parent value".

  Furth uses this mechanism for many purposes, including to provide a
  mechanism for "object", "closures", and "local variables".

  The register $env normally contains a mutable value used for
  (depending on how you think about programming) the "this" object or
  the current lexical context (local variable set).  There are
  primitive commands to create and destroy bindings in the environment
  as well as to reference and set those bindings.

  Furth environments are fully dynamic: an update to a parent
  environment effectively propogates to children environments.
  Looking up, modifying, defining or deleting bindings can therefore
  be a fairly heavyweight operation and Furth implementations are 
  encouraged to optimize it as far as practical.


* The $root Register

  The $root register is a second environment, like $env.  It serves as
  a kind of "blackboard" via which commands can communicate
  asynchronously and anonymously, without having to pass parameters
  through to one another explicitly.


* Some Standard Types

  We've already mentioned standard types for small integers.  In fact,
  Furth also supports double precision floating point numbers and
  eventually a complete Scheme-like numeric tower (but extensible).
  Each numeric type is its own primitive type but a more general
  "meta"-type for numbers is also defined:

        int fur_is_number (t_fur_vm vm, t_fur_location * it);

  along with friendly conversion functions to and from C's numeric
  types.

  We also have a primitive type, keywords, used to represent many
  primitive commands.   Keywords are written as a colon followed by 
  a keyword name.    Two examples are the commands:


        :halt           -- exit the virtual machine with no error
        :hcf            -- exit the virtual machine with an error indication


  One of the more interesting standard types is the _cons_pair_:

  A cons pair is a two-element array of a distinct type, structurally
  immutable but (usually) with mutable bindings.   In other words, you 
  can reference or set slots 0 and 1 of a cons pair, but you can't add
  or subtract any other fields.

  Following lisp, we can write a cons pair like this:

        (a . b)         -- element 0 is a, element 1 is b

   and we can abbreviate a nested list of pairs:

        (a . (b . (c . (d . e))))

   using the more succinct syntax:

        (a b c d . e)

   All values are primitive commands and pair values are no exception.
   Suppose that at the beginning of a cycle, the virtual machine
   has register values:


        $inx = (a . b)
        $next = <X>

   (we don't care about $tuple, $env, or $global).   Then when the
   "cons pair" command executes, it modifies the VM state to be:

        $inx = a
        $next = (b. <X>)

   As a special exception to that rule, if the initial state is:

        $inx = (a . b)
        $next = :hcf

   then the final state is:

        $inx = a
        $next = b

   You can trace through the effect of a run applying that rule and
   assuming that both a and b end with:

        $inx := $next
        $next := :hcf

   the complete trace is:

        $inx = (a . b)
        $next = <X>

      => (run the "cons pair" primitive)

        $inx = a
        $next = (b . <X>)

      => (run the "a" primitive)         

        $next = (b . <X>)
        $next = :hcf

      => (run the "cons pair" primitive)

        $next = b
        $next = <X>

      => (run the "b" primitive)

        $next = <X>
        $next = :hcf

      => (run the "<X>" primitive)

        [...]

  The careful observer will note that the above trace uses only a 
  constant amount of space and it correctly does what we hoped:
  it runs "a", then "b", then "<X>".

  On that basis, one can imagine other kinds of value that might
  be of use.   For example, suppose that in addition to ordinary
  cons pairs, we have "conditional cons pairs" which we can 
  write:

        :(if a b)               ; a is elt 0, b is elt 1
                                ; of this "conditional cons pair"

  What happens if a conditional cons pair is exectued as an
  instruction?  Two rules apply:


    + general case

        $inx = :(if a b)
        $next = <X>
        $tuple = ... <Y>        ; any stack with <Y> on top

      =>

        $inx = a                ; execute a, ignore b
        $next = <X>
        $tuple = ...            ; have popped the <Y>

    + special case


        $inx = :(if a b)
        $next = <X>
        $tuple = ... #f        ; any stack with #f ("false") on top

      =>

        $inx = b                ; execute _b_, ignore a
        $next = <X>
        $tuple = ...            ; have popped the #f

  Suppose that we have a simple furth interpreter that reads commands
  one after the other, places each in the $inx register (and suitably
  arranges the other registers), executes the command and repeats the
  loop.  Putting together the stack-like nature of $tuple, the
  self-pushing nature numbers, the availability of basic arithmetic
  and stack operations, and cons pairs, and conditional cons pairs, we
  can say what such an interpreter does given the input:


        (:dup 0 < :(if -1 (:dup 0 > :(if 1 0))) . :print)

  A trace begins:

        $inx = (:dup 0 < :(if -1 (:dup 0 > :(if 1 0))) . :print)
        $next = <X>
        $tuple = .... <Y>
      =>
        $inx = :dup
        $next = ((0 < :(if -1 (:dup 0 > :(if 1 0))) . :print) . <X>)
        $tuple = .... <Y>
      =>
        $inx = ((0 < :(if -1 (:dup 0 > :(if 1 0))) . :print) . <X>)
        $next = :hcf
        $tuple = .... <Y> <Y>
      =>
        etc

  and the program ultimate prints the value of the "signum" function
  for the number found on top of the stack upon entry.


* Variable Capture

  Suppose that, upon entry to a command, the tuple stack contains
  values:

        $tuple = a b c d .... z

  and $env has:

        $env = <X>

  A primitive command, called a "capture command", can convert
  those anonymous stack values into named local variables:

        $tuple = a b c d ... z
        $env = <X>
        $inx = :(capture name0 name1 .... nameN)
        $next = <Y>

   =>

        $tuple = <empty>
        $env = #(name0 => a
                 name1 => b
                 ....
                 nameN => z
                 parent => <X>)
        $inx = <Y>
        $next = :hcf


  Such variables can be referenced and set:

        :ref definition                 :set definition
        ---------------                 ---------------

        $tuple = <empty>                $tuple = <Z>
        $env = #(name0 => a
                 name1 => b             ...
                 ....             
                 nameN => z
                 parent => <X>)
        $inx = :(ref name1)             $inx = :(set name1)
        $next = <Y>                     ...
      =>
        $tuple = b                      $tuple = <empty>
        $env = #(name0 => a ...)        $env = #(name0 => a name1 => <Z> ...)
        $inx = <Y>                      ...
        $next = :hcf

  Using :capture, :ref, :set, and related operations, furth programs
  can implement:

        ~ high-level calling conventions
        ~ creation of lexical closures
        ~ instantiation of object classes



* Applications and Possibilities

  C programs embedding Furth or being loaded into a stand-alone furth
  can add new primitive commands and, thus, new data types.   That's a 
  very general framework and one that's easy to extend.

  At one end of a spectrum, it would be easy to write primitive furth
  commands to read one line of input, create a string value from that,
  and push it on the tuple stack.   It would be easy to add commands
  to compare that string to a regular expression, perhaps edit it, 
  and so forth.    Combined with very simple flow of control and a
  use-by-convention of $env as a "hold register", Furth would become
  a virtual machine for the programming language of "/bin/sed".

  Near that end of the spectrum, a few simple types and a choice of
  high-level coding convention would turn Furth into awk, with simple
  function definitions, global and local variables, the awk i/o and
  flow-control model, etc.

  Towards the far end of the spectrum: The Furth VM is a trivial
  target for a (fully standard) Scheme compiler as well as for a
  compiler for a Python-like language sufficiently close to Python to
  reveal that the remaining differences are over fairly arbitrary and
  insigificant details.

  And very far along: one could add primitive commands to Furth which
  implement graph-reduction for supercombinators --- in other words,
  the trivial loop which comprises the furth inner interpreter, plus 
  the fact that we have a few registers to play with, plus the fact
  that we have a simple, homogenous, fully general yet optimizal data
  type (the "value"): those add up to to a framework which could be
  extended into being a respectable implementation of Haskell and
  similar languages.

  Another idea is to add "virtual coprocessors" to furth.  For
  example, one could add a primitive command whose effect is to
  execute a short "sub-program" (or "macro instruction") as
  interpreted by, say, an interpreter that can do fast arithmetic on
  unboxed integers and floats.  Furth is minimal enough that such a
  coprocessor can fit in non-disruptively.  Furth's single-typed
  domain of values reduces the issue to a tractable problem of cleanly
  exchanging data between furth proper and the coprocessor.

  The Furth API (and, for that matter, the entire reference
  implementation) are _not_ built for maximal speed.  Rather, they are
  built for exceptional simplicity.  Furth could be extended to be a
  decent Haskell, but it will be a very long time (not forever,
  necessarily) before Furth will ever be a
  Haskell-to-compete-against-optimized-haskells.   Neverthless, it
  certainly can be fast enough to be very useful and I'm gambling it
  can be fast enough to keep me busy for a very long time to come.

  The real question, in my mind, is how to glue various programming
  idioms and the languages they suggest together:

  For example, let's suppose that I put Scheme on top of the furth VM
  and do most of my programming in Scheme  (a silly idea, but suppose
  I do it anyway).   Someone else puts "/bin/sed" on furth.

  Can I put those together?  Can I call Scheme from sed and sed from
  Scheme?  If I can, that's likely to be a huge win.  For example, on
  the one hand, it is a lot easier to optimize sed-on-furth than
  Scheme-on-furth.  So even if my Scheme is a little slow, the calls
  to the sed-like routines can be very fast.  On the other hand, it
  can be a lot _clearer_ to write some program-parts in sed than to
  write those same parts in Scheme.  There are many things a sed-like
  language can't express well or even at all, but for those things it
  can express well, it can be far clearer, simpler, and more
  maintainable than just about any alternative language.

  Furth is designed to encourage that kind of combination.  Aside
  from the simple excution model (the loop that continually calls
  the command pointed at by $inx), the uniform abstraction for 
  all values, along with emphasis on a few universally available
  general purpose value types (like numbers and cons pairs) is
  intended to help keep the coupling between different programming
  language extensions to furth minimal and tractable.


* Making it Fast

  Some optimizations have already been hinted at.   For example,
  while a "general value" needs to be heap allocated and has
  fairly complicated structure, a small constant integer doesn't
  need to be heap allocated.

  The lookup of names in environments such as $env is likely to be 
  a performance-critical operation.   By allowing environments to be
  modified on the fly, Furth places an upper-bound on the (general
  case) amount of optimization that is possible.

  In the reference implementation, I plan to use a large hash table
  to speed up such references.   If you ask:

        fur_ref (&answer, vm, &dictionary, &key)

  the hash values of the dictionary and key will be combined and 
  looked up in a table.   If present, that resolves the reference
  in O(1) steps --- still involving a hash computation and table
  lookup, but at least O(1).    There is some trickiness to
  maintaining the hash table accurately as environments are modified.

  Beyond that, the philosophy of Furth is that a general purpose
  programming language should be "dynamic in general, optimizable in
  particular".    In other words, it is best to not restrict the
  language by removing dynamic features (like environment
  modification) even though that thwarts general-case optimizations
  just so long as, at the same time, subsets of the language which
  clearly avoid the dynamic features can be strongly optimized.


* Summary:

  Registers:

        $inx -- what command to run on the next cycle
        $next -- continuation parameter to that command
        $env -- the local variables
        $global -- the shared "blackboard"
        $tuple -- parameters (aka return values) passed between commands

  Types:
        All values are of the same general type which is a type-tagged
        dictionary with a distinct identity.   Constants like "42" are
        immutable, (essentially) empty dictionaries whose "distinct
        identity" captures the "42-ness of 42".   

        General, mutable objects function as "environments", with
        single inheritence of bindings from another environment.

        Within an environment-like value, the binding names 0, 1, 2
        etc. are treated specially:  they are fast to access, making
        values seem "array-like" or "stack-like".

  Execution:
        All values are also commands.  Most, like numbers, just push
        themselves on the data stack ($tuple).   Some, like cons pairs
        and conditional cons pairs are "composite values" used to 
        implement flow of control.

  Base Language:
       The base Furth language just reads and executes commands, one
       by one.   Thus, the program:

           2 2 :+ :print

       executes four commands; the first two each push a number on the
       stack; the third replaces those with their sum;  the fourth
       prints the result.   In a program like:

          (2 2 :+) print

       there are only two commands but the first one:

          (2 2 :+)

       is a composite command that may take many cycles to complete.


* What's Left Out So Far and Future Directions

  After I finish up the core VM implementation, the next thing is to 
  define the set of primitive commands which are guaranteed to be 
  there.

  Aside from obvious things like basic data types (numbers,
  characters, strings, lists, etc.) I intend to define a minimal
  run-time system with support for I/O, lexical analysis, parsing, and
  minimal debugging facilities.   As in Forth, these will be used to 
  define a semi-standard "outer-interpreter" or "repl":  a standard
  (extensible) program which interactively (or from a file) reads and
  executes commands, one by one.

-t


;;; arch-tag: Tom Lord Mon Jun 28 11:34:49 2004 (furth/README.C-perspective)


----

Like my work on GNU arch, Pika Scheme, and other technical contributions 
to the public sphere?   Show your support!

https://www.paypal.com/xclick/business=lord%40emf.net&item_name=support+for+arch+and+other+free+software+efforts+by+tom+lord&no_note=1&tax=0&currency_code=USD

and

address@hidden for www.moneybookers.com payments.


-----

  "Hello, all of you boys and girls,
   I'd like to take you to the inside world."
                           - Claypool





reply via email to

[Prev in Thread] Current Thread [Next in Thread]