emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unquoted special characters in regexps


From: martin rudalics
Subject: Re: Unquoted special characters in regexps
Date: Tue, 28 Feb 2006 11:27:01 +0100
User-agent: Mozilla Thunderbird 1.0 (Windows/20041206)

> `]', like `-' are only special in the context of a character
> alternative, that is if, before you type them, you are in a character
> alternative.   By contrast, `['  and all other special characters
> (except `^') are  only special outside that context.

You can talk about a context iff you are able to grammatically specify
it.  In order to talk about the contents of a string you must be able to
determine the character sequences opening and closing strings.  It would
be strange to say, for example, that the double-quote opening an Elisp
string is outside the context of the string and the double-quote that
closes it inside.  It would be strange to say that the bracket opening a
character alternative is outside the context of the alternative and the
closing bracket inside.

> All characters that are special outside character alternatives are
> never special if you precede them with a backslash.  This is true even
> for `^'.  This is why it is good to precede them with a backslash even
> if they are not special.  That way, the reader can see that they are
> not special, without studying the regexp.

I agree.  Let's try to read the following definition from `cc-fonts.el':

(defconst autodoc-font-lock-doc-comments
  `(("@\\(\\w+{\\|\\[\\(address@hidden|@@\\)*\\]\\|address@hidden|$\\)"
 ...

It tells me that there are two character alternatives started by an
unquoted `[' and terminated by an unquoted `]'.  It also tells me that
it's meant to match a bracketed expression as represented by `\\[' and
`\\]' - I quickly exclude the possibility that the backslashes preceding
any of these brackets are quoted backslashes in a character alternative.
And, finally, the expression tells me that the author was probably
uncertain about how to put a `]' inside a complemented character
alternative, hence (s)he quoted it with a single backslash.  In any case
I have no difficulties reading the expression although I completely
ignore its meaning.  You propose to write

(defconst autodoc-font-lock-doc-comments
  `(("@\\(\\w+{\\|\\[\\(address@hidden|@@\\)*]\\|address@hidden|$\\)"
 ...

instead.  In that case, when I look at the character sequence `*]' I
would have to consider the case that the `]' closes some character
alternative.  Only after I resolved that I would be able to say that the
`]' should indeed match a right bracket.  And I would still have to
check whether the backslashes preceding the `\\[' are quoted backslashes
in a character set.

> First of all, there are (surprisingly) many occurrences of "\\]" in
> the Emacs source, where the `]' _is_ special and closes a character
> alternative that contains a slash.  Reportedly quoting a `]' with a
> backslash _inside_ a character alternative works in some other regexp
> implementations such as AWK.  So if I see "\\]" I have to worry about
> three possibilities:  it might deliberately close a character
> alternative which includes a slash, it might do so by accident because
> the author tried to quote a `]' inside a character alternative (and
> hence the regexp is buggy), or it might be a deliberately quoted `]'
> outside a character alternative.

The Emacs manual clearly states that the backslash is not special in a
character set.  But I admit that users of other languages do have
problems when writing Elisp regexps.  That's why a clear and unambiguous
definition of these concepts is important.

> If I see `]' without preceding "\\", I only have to worry about
> whether or not it closes a character alternative, and not about the
> third possibility of a bug.

When I try to read a regular expression I do not worry about the
possibility of a bug in the first place.  I try to understand what the
author wanted to match.

> There are places in the Emacs code that quote a `]' outside a
> character alternative.  Even if we decide that this is undesirable, I
> do not fancy finding and changing them all.  But we could change the
> behavior of `regexp-quote' and `regexp-opt' which currently quote
> such `]'.  That could be done with the following trivial patch, which
> I could install if that is what we decide to do:

Given the amount of regular expressions users created with these
functions and manually inserted in their code that would be confusing
indeed.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]