[Gnu-arch-users] a new project

gnu-arch-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] a new project

From:	Thomas Lord
Subject:	[Gnu-arch-users] a new project
Date:	Mon, 10 Oct 2005 01:33:14 +0000
One way to help advance the cause of distributed revision control
is to facilitate a greater number of applications that can make good
use of it.   In that direction:


  Contents: 
    * The Goal: "Human Friendly Application Documents" (HFADs)
    ** What is an HFAD?
    ** The Most Well Known Example of an HFAD
    ** Envisioned Applications ** The Technical Goals
    * Solution Theory -- LALR(1) Recursive Decomposition Parsing
    ** Illustrations
    ** Performance
    * Implementation Strategy
    * How to Help



* The Goal: "Human Friendly Application Documents" (HFADs)

  The goal of this project is to create software tools for implementing
  "human friendly application documents" (HFADs).

** What is an HFAD?

  An HFAD is a plain-text format for structured data, designed
  for 

    ~ readability and editability

      The syntax should have a natural appearance -- the structured
      data should be presentable as one human might present it 
      to another, even though it must also be readable by programs.

      It should be easy to learn to create or edit data-files using
      the syntax using an ordinary text processor such as Emacs or vi.

      For example, the syntax should involve the minimum achievable
      amount of markup and what markup there is should be as one might
      type the same information on paper, for human consumption.

    ~ good error handling

      Syntax errors should be handled well.   Errors should not
      "cascade", where possible --- if a subsection of a file is
      bogus, later sections should still be parsable.  This is
      more important for some kinds of syntax error than others
      (those that users might make while typing and those that
      merge tools might produce).

      Syntax errors should be easy for users to spot by eye and
      correct, with minimal training.  To an extent, this is 
      achieved by "readability and editability".

      In most applications, if a part of the document contains a
      syntax error, the application should be able to recover by
      showing the user the plain-text content of that section.  Either
      the plain-text format is "useful enough" directly or the user
      can correct it using a generic text editor.

    ~ good branching and merging using generic tools

      We envision "teamware" applications wherein multiple users
      copy and separately modify a single data file, and then 
      want to somehow combine those changes to produce a unified
      version.

      For simplicity (to application programmers and users) we would
      like the technology for merging to be generic: to work the same
      way for most or all applications.   We anticipate using 
      `diff/patch'-style or `diff3'-style merging for this purpose.

      Such generic merging tools reliably sometimes corrupt files by
      being unable to produce a perfect merge.  (This problem is an
      intrinsic part of the problem of merging generally, not of
      `diff/patch/diff3' specifically.)

      This is one reason to want a syntax that supports good error
      handling: applications using the syntax should be able to 
      assist users in resolving merge conflicts using generic
      tools.

** The Most Well Known Example of an HFAD

  Wiki syntaxes are the best known example of an HFAD.


** Envisioned Applications

  Beyond simple wikis, we envision applications such as:

    ~ more general word processors

      We would like to provide a new style of tool for the
      kinds of user needs currently satisfied by applications
      such as OpenOffice

    ~ accounting, tabulating, and database-reporting applications

      We imagine applications that fill the user need of 
      spreadsheets but that are different in form.

      In particular we plan for multi-media documents that 
      happen to contain active elements such as variable
      definitions, equations, graphing specifications, 
      query specifications, and tables of these.

      It is not initially obvious that, other than how the
      functionality is developed, this is really a separate
      application from more general word processors.

    ~ calendars, contact databases, etc.

      The full range of PIM applications.

    ~ literate programming systems

      We envision pre-processors for programming language
      source text that process input containing a mixture of
      source code and documentation, with a strong emphasis
      on documentation.   Knuth's CWEB and similar systems
      give a pleasing printed representation but are agony
      to read in source form.   Systems such as Docbook leave
      the source looking reasonably conventional but have
      limited capabilities to restructure source for presentation
      or to do much more than document an API.

  We anticipate plain-text, text-editor-style interfaces
  to these applications, fully GUI interfaces, and hybrids
  (such as "edit in text, see in GUI").  Interesting possibilities
  exist for "web browser based" applications and "[x]emacs based
  applications".

  We plan for all applications to be "teamware-ready", meaning
  that collaboration on data files, with branching and merging
  of efforts, will be a fundamental capability of all of them.

  We believe that HFAD formats, especially for static data (such as
  simple documents) is ideal for use on very minimal clients.
  Plain-text formats are useful on simple terminals and on hand-held
  devices with very minimal display and input capabilities.



** The Technical Goals

  HFAD syntaxes should be formally specified, in a machine readable
  format, from which a parser can be generated.  Such specifications
  are also useful for the purpose of formal interoperability
  standards.

  In the common case, the parser generated from a syntax specification
  should simply generate a tree using a standard data structure (such
  as any of the DOM APIs).  The tools should make this common case
  easy to achieve.

  Programmers should be able to specify alternative actions for the
  parse to take.  Instead of generating a standard tree, a parser
  might generate a different data structure or simply directly 
  interpret input.

  The expressive range of HFAD syntaxes should be no less than that
  of LALR(1) grammars.

  It should be easier to define and maintain a new HFAD syntax 
  using these tools than to write parsers "by hand".


* Solution Theory -- LALR(1) Recursive Decomposition Parsing

  We define an HFAD syntax to consist of a set of LALR(1) syntaxes,
  related in a hierarchy.

  Each terminal symbol in one of the grammars stands for two things: a
  class of lexemes which correspond to that terminal symbol (as
  usual), and a sub-grammar (which is likely to have its own, separate
  lexical syntax) used to further parse the lexeme.

  Using the highest-level grammar and lexical syntax, parsing 
  begins in the ordinary, left-to-right, single-pass way.

  When a production containing a terminal symbol is reduced,  
  each subtext corresponding to a terminal is re-parsed, using
  the sub-grammar corresponding on the terminal.   Parameters
  (e.g., indentation level) may be generated by lexical analyzer
  which matched the upper-level terminal which are passed to the
  subordinate lexical analyzer.   This process continues recursively
  for all terminals encountered for which sub-grammars are defined.

  If a grammar at a particular level fails to parse the text it is
  handed, the corresponding token in the next higher grammar is
  converted to the special non-terminal `<error>'.

  We are not certain, yet, how to characterize what expressive power
  we gain (if any) compared to ordinary LALR(1) grammars.
  Regardless, this approach is a convenient mode of expression:



** Illustrations

  At one level, a grammar for a form of document (as for a conference
  paper) may contain the fragment:

        body:   /* empty */
              | body <section>
              | body <error>

  Terminals are indicated in these examples by `<...>' quoting.

  The document syntax must provide a separate grammar for `<section>'.

  There are three advantages to using a separate grammar, and a separate
  parsing pass to deconstruct each `<section>'.

  First, the broad structure of the document can be recognized and 
  handled regardless of any syntax errors in a particular section.
  Assuming that the broad lexical syntax of sections is correct,
  a section containing an error is converted to an `<error>' token.
  On a web page, for example, most sections might be converted to
  nicely formatted hypertext while the errant section is displayed
  in source form.

  Second, the lexical analysis of each subsection can be parameterized
  by the lexical analysis of the section overall.  For example, if the
  grammar for `body' finds that a section is indented five spaces, the
  subordinate lexical analyzer for `<section>' can take that into
  account.

  Third, it is extremely convenient to apply entirely unrelated 
  lexical analyzers at different levels of parse.   This can be
  illustrated by considering a hypothetical wiki syntax with 
  support for embedded mathematical equations:

       Joe and/or Sally computed $ (a^2 + b^2)/(c^2 + d^2) $

  In the first instance, "/" in that fragment is not usefully a
  separate lexeme.  In the second instance, it almost certainly
  is.   Here we would expect a higher-level grammar containing
  a fragment such as:

        text_block:  [...]
                  |  text_block <equation>

  where the terminal `<equation>' is, say, text delimited by `$'
  characters.

  In the recursive parse of the equation, an entirely separate lexer
  is used -- one which can treat `/' very differently.


** Performance

  This describes a multi-pass parsing technique whose worst-case
  performance is `O(N * L)' where `N' is the length of input 
  texts and `L' is the number of grammars involved in the 
  syntax specification.

  We generally expect `L' to be small and worst-case performance to
  rarely occur.   

  For many practical applications, therefore, we expect the 
  performance trade-offs compared to conventional LALR(1) parsing
  to be well worth the gains in ease of specification and 
  excellent error handling.

  For wiki applications specifically, we expect this approach to
  be a large gain *if* some care is taken in writing lexical 
  analyzers for the various levels.  Today's wikis generally use
  ad-hoc, multi-pass approaches making heavy use of dynamically 
  compiled regexps.  We are certainly unlikely to do worse than
  that, performance-wise.


* Implementation Strategy

  Conveniently, `GNU Bison' provides features for generating
  reentrant parsers, more than one parser per compiled program,
  and for adding parameters to the parser and lexer API.

  We intend, ultimately, to invent a new, yacc-like tiny language
  for specifying our HFAD grammars and to implement a translator
  which generates Bison input from that source.

  For now, we intend to rely mostly on "hand-coded" lexers until we
  have a better understanding of what kinds of lexers most commonly
  arise in practice.  While a `lex' (regular expression based) lexer
  may be right for a sub-grammar for mathematical equations, we foresee
  higher-level lexers needing to make use of context (e.g. indentation
  levels) in ways that don't map onto regular expression lexers in
  obvious ways.

  A two phase approach is planned for building the parsing tools:

  Phase 1: Devise a moderately interesting Wiki-style syntax
  for static text to be translated to, at least, HTML and TeX.
  Implement a recursive decomposition parser "by hand", writing
  Bison input files directly.

  Phase 2: With the lessons learned, design a higher-level language
  that more directly captures recursive decomposition parsing and 
  implement a translator from that syntax to Bison input.


* How to Help

  If you have theoretical insights or prior-art comparisons, I'd
  like to hear them.

  Beyond that, the best way to help fund the project is to help 
  pay for it.  If you want to go off and work on it yourself,
  please consider rewarding me for putting the idea forward.

  I am address@hidden' for inquires and also for paypal gifts.

-t
[Prev in Thread]
Current Thread
[Next in Thread]
[Gnu-arch-users] a new project, Thomas Lord <=
- Re: [Gnu-arch-users] a new project, Andy Tai, 2005/10/10
  - Re: [Gnu-arch-users] a new project, Jan Hudec, 2005/10/11
  - Re: [Gnu-arch-users] a new project, Ludovic Courtès, 2005/10/11
- [Gnu-arch-users] Re: a new project, Michael Olson, 2005/10/12
Prev by Date: [Gnu-arch-users] licensing clarification
Next by Date: Re: [Gnu-arch-users] a new project
Previous by thread: [Gnu-arch-users] licensing clarification
Next by thread: Re: [Gnu-arch-users] a new project
Index(es):
- Date
- Thread