[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] a new project
From: |
Thomas Lord |
Subject: |
[Gnu-arch-users] a new project |
Date: |
Mon, 10 Oct 2005 01:33:14 +0000 |
One way to help advance the cause of distributed revision control
is to facilitate a greater number of applications that can make good
use of it. In that direction:
Contents:
* The Goal: "Human Friendly Application Documents" (HFADs)
** What is an HFAD?
** The Most Well Known Example of an HFAD
** Envisioned Applications ** The Technical Goals
* Solution Theory -- LALR(1) Recursive Decomposition Parsing
** Illustrations
** Performance
* Implementation Strategy
* How to Help
* The Goal: "Human Friendly Application Documents" (HFADs)
The goal of this project is to create software tools for implementing
"human friendly application documents" (HFADs).
** What is an HFAD?
An HFAD is a plain-text format for structured data, designed
for
~ readability and editability
The syntax should have a natural appearance -- the structured
data should be presentable as one human might present it
to another, even though it must also be readable by programs.
It should be easy to learn to create or edit data-files using
the syntax using an ordinary text processor such as Emacs or vi.
For example, the syntax should involve the minimum achievable
amount of markup and what markup there is should be as one might
type the same information on paper, for human consumption.
~ good error handling
Syntax errors should be handled well. Errors should not
"cascade", where possible --- if a subsection of a file is
bogus, later sections should still be parsable. This is
more important for some kinds of syntax error than others
(those that users might make while typing and those that
merge tools might produce).
Syntax errors should be easy for users to spot by eye and
correct, with minimal training. To an extent, this is
achieved by "readability and editability".
In most applications, if a part of the document contains a
syntax error, the application should be able to recover by
showing the user the plain-text content of that section. Either
the plain-text format is "useful enough" directly or the user
can correct it using a generic text editor.
~ good branching and merging using generic tools
We envision "teamware" applications wherein multiple users
copy and separately modify a single data file, and then
want to somehow combine those changes to produce a unified
version.
For simplicity (to application programmers and users) we would
like the technology for merging to be generic: to work the same
way for most or all applications. We anticipate using
`diff/patch'-style or `diff3'-style merging for this purpose.
Such generic merging tools reliably sometimes corrupt files by
being unable to produce a perfect merge. (This problem is an
intrinsic part of the problem of merging generally, not of
`diff/patch/diff3' specifically.)
This is one reason to want a syntax that supports good error
handling: applications using the syntax should be able to
assist users in resolving merge conflicts using generic
tools.
** The Most Well Known Example of an HFAD
Wiki syntaxes are the best known example of an HFAD.
** Envisioned Applications
Beyond simple wikis, we envision applications such as:
~ more general word processors
We would like to provide a new style of tool for the
kinds of user needs currently satisfied by applications
such as OpenOffice
~ accounting, tabulating, and database-reporting applications
We imagine applications that fill the user need of
spreadsheets but that are different in form.
In particular we plan for multi-media documents that
happen to contain active elements such as variable
definitions, equations, graphing specifications,
query specifications, and tables of these.
It is not initially obvious that, other than how the
functionality is developed, this is really a separate
application from more general word processors.
~ calendars, contact databases, etc.
The full range of PIM applications.
~ literate programming systems
We envision pre-processors for programming language
source text that process input containing a mixture of
source code and documentation, with a strong emphasis
on documentation. Knuth's CWEB and similar systems
give a pleasing printed representation but are agony
to read in source form. Systems such as Docbook leave
the source looking reasonably conventional but have
limited capabilities to restructure source for presentation
or to do much more than document an API.
We anticipate plain-text, text-editor-style interfaces
to these applications, fully GUI interfaces, and hybrids
(such as "edit in text, see in GUI"). Interesting possibilities
exist for "web browser based" applications and "[x]emacs based
applications".
We plan for all applications to be "teamware-ready", meaning
that collaboration on data files, with branching and merging
of efforts, will be a fundamental capability of all of them.
We believe that HFAD formats, especially for static data (such as
simple documents) is ideal for use on very minimal clients.
Plain-text formats are useful on simple terminals and on hand-held
devices with very minimal display and input capabilities.
** The Technical Goals
HFAD syntaxes should be formally specified, in a machine readable
format, from which a parser can be generated. Such specifications
are also useful for the purpose of formal interoperability
standards.
In the common case, the parser generated from a syntax specification
should simply generate a tree using a standard data structure (such
as any of the DOM APIs). The tools should make this common case
easy to achieve.
Programmers should be able to specify alternative actions for the
parse to take. Instead of generating a standard tree, a parser
might generate a different data structure or simply directly
interpret input.
The expressive range of HFAD syntaxes should be no less than that
of LALR(1) grammars.
It should be easier to define and maintain a new HFAD syntax
using these tools than to write parsers "by hand".
* Solution Theory -- LALR(1) Recursive Decomposition Parsing
We define an HFAD syntax to consist of a set of LALR(1) syntaxes,
related in a hierarchy.
Each terminal symbol in one of the grammars stands for two things: a
class of lexemes which correspond to that terminal symbol (as
usual), and a sub-grammar (which is likely to have its own, separate
lexical syntax) used to further parse the lexeme.
Using the highest-level grammar and lexical syntax, parsing
begins in the ordinary, left-to-right, single-pass way.
When a production containing a terminal symbol is reduced,
each subtext corresponding to a terminal is re-parsed, using
the sub-grammar corresponding on the terminal. Parameters
(e.g., indentation level) may be generated by lexical analyzer
which matched the upper-level terminal which are passed to the
subordinate lexical analyzer. This process continues recursively
for all terminals encountered for which sub-grammars are defined.
If a grammar at a particular level fails to parse the text it is
handed, the corresponding token in the next higher grammar is
converted to the special non-terminal `<error>'.
We are not certain, yet, how to characterize what expressive power
we gain (if any) compared to ordinary LALR(1) grammars.
Regardless, this approach is a convenient mode of expression:
** Illustrations
At one level, a grammar for a form of document (as for a conference
paper) may contain the fragment:
body: /* empty */
| body <section>
| body <error>
Terminals are indicated in these examples by `<...>' quoting.
The document syntax must provide a separate grammar for `<section>'.
There are three advantages to using a separate grammar, and a separate
parsing pass to deconstruct each `<section>'.
First, the broad structure of the document can be recognized and
handled regardless of any syntax errors in a particular section.
Assuming that the broad lexical syntax of sections is correct,
a section containing an error is converted to an `<error>' token.
On a web page, for example, most sections might be converted to
nicely formatted hypertext while the errant section is displayed
in source form.
Second, the lexical analysis of each subsection can be parameterized
by the lexical analysis of the section overall. For example, if the
grammar for `body' finds that a section is indented five spaces, the
subordinate lexical analyzer for `<section>' can take that into
account.
Third, it is extremely convenient to apply entirely unrelated
lexical analyzers at different levels of parse. This can be
illustrated by considering a hypothetical wiki syntax with
support for embedded mathematical equations:
Joe and/or Sally computed $ (a^2 + b^2)/(c^2 + d^2) $
In the first instance, "/" in that fragment is not usefully a
separate lexeme. In the second instance, it almost certainly
is. Here we would expect a higher-level grammar containing
a fragment such as:
text_block: [...]
| text_block <equation>
where the terminal `<equation>' is, say, text delimited by `$'
characters.
In the recursive parse of the equation, an entirely separate lexer
is used -- one which can treat `/' very differently.
** Performance
This describes a multi-pass parsing technique whose worst-case
performance is `O(N * L)' where `N' is the length of input
texts and `L' is the number of grammars involved in the
syntax specification.
We generally expect `L' to be small and worst-case performance to
rarely occur.
For many practical applications, therefore, we expect the
performance trade-offs compared to conventional LALR(1) parsing
to be well worth the gains in ease of specification and
excellent error handling.
For wiki applications specifically, we expect this approach to
be a large gain *if* some care is taken in writing lexical
analyzers for the various levels. Today's wikis generally use
ad-hoc, multi-pass approaches making heavy use of dynamically
compiled regexps. We are certainly unlikely to do worse than
that, performance-wise.
* Implementation Strategy
Conveniently, `GNU Bison' provides features for generating
reentrant parsers, more than one parser per compiled program,
and for adding parameters to the parser and lexer API.
We intend, ultimately, to invent a new, yacc-like tiny language
for specifying our HFAD grammars and to implement a translator
which generates Bison input from that source.
For now, we intend to rely mostly on "hand-coded" lexers until we
have a better understanding of what kinds of lexers most commonly
arise in practice. While a `lex' (regular expression based) lexer
may be right for a sub-grammar for mathematical equations, we foresee
higher-level lexers needing to make use of context (e.g. indentation
levels) in ways that don't map onto regular expression lexers in
obvious ways.
A two phase approach is planned for building the parsing tools:
Phase 1: Devise a moderately interesting Wiki-style syntax
for static text to be translated to, at least, HTML and TeX.
Implement a recursive decomposition parser "by hand", writing
Bison input files directly.
Phase 2: With the lessons learned, design a higher-level language
that more directly captures recursive decomposition parsing and
implement a translator from that syntax to Bison input.
* How to Help
If you have theoretical insights or prior-art comparisons, I'd
like to hear them.
Beyond that, the best way to help fund the project is to help
pay for it. If you want to go off and work on it yourself,
please consider rewarding me for putting the idea forward.
I am address@hidden' for inquires and also for paypal gifts.
-t
- [Gnu-arch-users] a new project,
Thomas Lord <=