help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Laurence Finston
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Thu, 20 May 2004 18:16:05 +0200
User-agent: IMHO/0.98.3+G (Webmail for Roxen)

-------------------
> Laurence Finston wrote:
> 
> > On Wed, 19 May 2004, Frank Heckenbach wrote:
> > I just checked the Flex manual, and at
> > the beginning of Chapter 17, "Performance Considerations", on page 

45,
> > it states that "arbitrary trailing context" and "pattern sets that
> > require backing up" as the second and third-most expensive
> > "options/actions" that degrade performance.  "beginning of
> > line operator" is the eighth-most expensive.  What I was thinking of
> > was the problem of matching a regexp like
> > ".*^(`a' || 'bb' || 'ccc' ...)" standing for "any string not ending 

in
> > `a', `bb', `ccc', etc.  This seems to be what Magnus wants, where the
> > list "a, bb, ccc" is specified by the user and can vary depending on
> > context.  This will be expensive no matter what, not just because of
> > the backing up, but because of all the table lookups.
> 
> Several points:
> 
> - Do you mean `/' instead of `^'? (`^' meaning beginning-of-line can
>   only occur at the start of the pattern, otherwise it means the `^'
>   character literally).
> 

My notation was ad hoc, which is why I explained what I meant.  The way I used
the `^' character was adapted from Emacs' regexps where `[^fg]' means "neither
f nor g".  I used an ad hoc notation because I don't know whether the
construction I used as an example can be formulated using Emacs' regexps;  I
don't think so, but maybe it can.

> 
>   Magnus suggested to use standard regex functions to search for
>   `a|bb|ccc', and return the characters before the match as the
>   "plain text" token. This should work, but is not possible in flex,
>   since flex always matches the patterns from the current position
>   (whereas usual regex functions match them anywhere in the string).
> 

In languages like C, C++, LISP, Pascal, Fortran, and Metafont, input 
files usually consist of lots of syntactically significant items and 
relatively few verbatim strings that are clearly delimited.  
Bison would be an appropriate choice for implementing a parser for all of
these languages.
TeX input files and HTML files, on the other hand, generally consist of 
lots of ordinary text and some clearly recognizable markup (it need not 
be delimited).  In each case, the (normally) largest part of the input is 
handled by a set of default rules and the (normally) smallest part is 
handled in a special way.  It's worth considering the reasons for this 
before trying a different solution.  

> 
> Flex performance generally does not depend on the number or
> complexity of the patterns (since DFAs are executed in linear time,
> once constructed which happens when flex runs, not at program run
> time). Only certain problematic features (namely REJECT and
> variable-length trailing context) make matching more complex. The
> other "performance considerations", AFAICS, only add a constant
> overhead per token (such as scanning the buffer for newlines with
> `%option yylineno' in some circumstances).

I don't like to wait, whether it's linear, logarithmic, or exponential 
time.  Documents can easily have hundreds of thousands of characters.  If I
have to match strings in the input against 100 "suspicious" tokens this will
take time.  In addition, since there are no hardwired criteria for determining
the beginning and end of an item of input, it seems quite likely that every
character in the input will have to be tested against the list of "suspicious"
tokens, whether individually, or as the character in the nth position of a
string.  It might be possible to reduce the number of comparisons by sorting
the "suspicious" tokens into an appropriate data structure, but this has
associated costs, too, including the necessity of writing code to traverse the
data structure.  If the list of suspicious tokens differs according to
context, it will be necessary to determine which table to use and to switch
tables, and memory will be required to store all of this information.  

I don't think all of this work is justified, since in most cases the result
will be "it's not markup, output it verbatim."  

Laurence



reply via email to

[Prev in Thread] Current Thread [Next in Thread]