RE: Structural regular expressions

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Structural regular expressions

From:	Drew Adams
Subject:	RE: Structural regular expressions
Date:	Fri, 10 Sep 2010 16:50:56 -0700
> the main point of the feature is there are enhanced regexps
> which are aware of the syntax of the buffer contents, so you
> can select comments, strings, scopes, etc.
> 
> Examples for the mentioned blog post:
> 
> V/pattern  select all matches
> V|pattern  select all lines with match
> V{scope    select all matching scopes
> Vatype     select all objects (inclusive)
> Vttype     select all objects (exclusive)
> Y/pattern  select everything but matches
> Y|pattern  select all lines without match
> Y{scope    select everything but scope
> Yatype     select everything but objects (inclusive)
> Yttype     select everything but objects (exclusive)
> 
> And you can perform further selections after the first selection
> recursively, so you can select comments in scopes, etc.
> 
> The document that inspired the above feature of the E editor:
> 
> "The current UNIXR text processing tools are weakened by the
> built-in concept of a line. There is a simple notation that can
> describe the `shape' of files when the typical array-of-lines
> picture is inadequate. That notation is regular
> expressions. Using regular expressions to describe the structure
> in addition to the contents of files has interesting
> applications, and yields elegant methods for dealing with some
> problems the current tools handle clumsily. When operations using
> these expressions are composed, the result is reminiscent of
> shell pipelines."
> 
> http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

Some more from the paper you cite:

"In these programs, regular expressions are being used to do
 more than just select the input, the way they are used in all
 the traditional UNIX tools.  Instead, the expressions are doing
 a simple parsing (or at least a breaking into lexical tokens)
 of the input. Such expressions are called structural regular
 expressions or just structural expressions."

And: [these programs] "benefit from an additional regular expression to define
the structure of [their] input."

That's the real point, I believe: the paper touts the use of regexps to divide
text into chunks that match - chunks that are not necessarily lines, in order to
then act on those chunks in some way.

This is just what Icicles search does.  You can provide an initial regexp that
parses the buffer to define a set of search contexts.  The regexp .* just parses
it into all of its lines.  Regexp \([^\f]*[\f]\|[^\f]+$\) parses it into pages;
\(.+\n\)+ into paragraphs; [A-Z][^.?!]+[.?!] into sentences; and so on.

You can provide such a regexp interactively, or define different commands that
encapsulate different context-defining regexps (e.g. search-lines (occur/grep),
search-paragraphs, search-sentences).

In general, a regexp used this way does not necessarily _partition_ the buffer -
there can be areas (gaps) that do not match at all.  Hence the mention by others
of possibly non-contiguous areas ("regions" or multi-part region).  The regexp
`(concat comint-prompt-regexp "\S-.*")' selects comint prompt lines, for
instance; and using an imenu generic regexp selects just function etc.
definitions for the current mode (just their first lines, typically).

But while a regexp is one handy way to parse a buffer, there is no reason to
limit the idea to using a regexp.  In spite of the fancy name "structural
regexp", _any_ way of dividing the buffer into a set of areas of interest can be
useful in the same way (e.g. as search contexts).  The real argument is that
lines are not the only way to go - grep/occur is not the only search tool (which
is not really news).

And it is misleading to say that regexps "describe the `shape' of files when the
typical array of lines picture is inadequate."  It is not about some file
"shape" or an inherent "structure" of the file content (e.g. code structure).

It is about being able to shape the parts of interest as you want and not always
be limited to lines as parts.  Use any regexp or any other pattern or algorithm
to define the _parts you want_ (e.g as search contexts).  _You_ define the
shapes of interest.

Can you use regexps to mimic/follow the "shape" of code?  Sure.  But you can
also use them to shape text (including code parts) in other ways.  Generalize
the shaping by regexps, and generalize the tools of shaping beyond just regexps.

And there is not even any need to limit this to areas of a buffer or file.

What this is really about (IMO) is these features:

1. Some way to come up with a set of strings as defined by pairs of buffer
positions.  The strings need not be associated with buffer positions, but that
is the typical case discussed.

2. Some way to filter those strings as a set.

3. Some way to act on the (filtered) strings, individually and perhaps also as a
set.  Search is one such action.

For the "structural regexp" fan, #1 is a regexp.  But a regexp is only one tool
you might use to parse a buffer into such a set.

For Icicles, #1 is often a regexp, but it need not be.  Font-lock provides
another #1.  Font-lock typically uses an ordered combination of regexps, but in
the general case it allows any parsing functions.  There are any number of other
#1's that could prove interesting.  A sophisticated parser can be just as useful
for #1 as is a simple regexp.

As another #1, Icicles can treat bookmarked regions as a search set.  (This
assumes an ability, as in Bookmark+, to bookmark regions: 2 positions, not 1.)
IOW, the strings ("regions") to be searched need not even be in the same buffer
or file.  A tags file could be used similarly, to "parse" a set of source files
into strings that represent function etc. definitions.

All that's needed is some way to define a set of strings and their locations.

For #2, Icicles lets you type an input pattern that filters the set dynamically
(incrementally).  Pattern matching here can use regexps, fuzzy matching,
whatever.  You can "pipe-filter": progressively apply multiple patterns to
narrow the set.  And you can complement the set of matches (complement the
current set wrt the previous filtering).

For #3, search has been mentioned as an obvious action for individual matches.
Likewise search-and-replace.  (Those are what Icicles search provides by
default.)  But in general any action might be applicable.  

A final comment.  There is nothing earth-shaking about using a regexp in this
way, to define a set of strings/areas to act on.  It hardly merits special
trumpeting.  And in spite of the usefulness of not being _limited_ to a
hard-coded parsing into lines, it is also true that (partly because much in the
way of programming does involve lines) acting on the lines of a file or buffer
or command-line input stream or error log _is_ often useful.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Structural regular expressions, (continued)
Prev by Date: Re: Linking Emacs with libxml2
Next by Date: Fixing parallel byte-compilation
Previous by thread: Re: Structural regular expressions
Next by thread: Re: Structural regular expressions
Index(es):
- Date
- Thread