[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: special characters in filenames in error messages
From: |
Bruno Haible |
Subject: |
Re: special characters in filenames in error messages |
Date: |
Sun, 14 Dec 2008 21:58:00 +0100 |
User-agent: |
KMail/1.9.9 |
Karl Berry wrote:
> What I don't understand with your proposal is how this magical url
> vs. file bit is known.
When producing/printing an error message, this information has to come
from the program. The program certainly knows whether it has used a function
like fopen() or open_url() to access the contents it is complaining about.
When parsing an error message, such as in Emacs' 'grep' mode, then - unless
the program is known to inspect only local files or only URLs - a heuristic
is indeed required:
> For example, if someone runs Henri's validator
> in Emacs, it seems to me that next-error is going to have heuristically
> guess whether it is a url to know how to interpret %'s.
Yes, it has to heuristically guess that. This heuristic is well-known:
it happens any time a user enters a file name into a browser's URL field.
For example, KDE 3 konqueror implements this heuristic by looking whether
a file with that name exists in the current directory:
$ konqueror 'file:/foo.txt'
tests whether the current directory has a subdirectory 'file:' and this one
has a file 'foo.txt'. If yes, this file is opened. Otherwise the string is
interpreted as an URL and canonicalized to "file:///foo.txt".
If you find that the need of a heuristic is a problem, then the solution that
comes to mind is to use different markers to distinguish the two cases: Use
"..." to enclose a file name, but <...> to enclose an URL. Then the parsing is
unambiguous. (The use of '<' '>' to mark an URL is widely used, see RFC 2396
section 2.4.3.) I'm still in favour of using the same escape syntax for the
file names, namely '%nn', so that end users see only one escape syntax.
This heuristic-avoiding proposal looks like this:
==============================================================================
Proposal B:
- For output of a filename in an error message:
The escaped syntax is required if the filename contains a ':' or
newline, or starts with a '"' or '<'. It may also be used for other
filenames.
In escaped syntax, a '"' is output. then the filename is output, with
escaping:
- Occurrences of '"' and '%' and newline are escaped as
%22 and %25 and %0A, respectively,
- Other US-ASCII characters may be escaped in %nn syntax as well,
where nn is the hexadecimal notation (case insignificant)
of the byte value in the US-ASCII encoding.
Finally a '"' is output.
Otherwise, the filename is output literally, without modifications.
- For output of an URL in an error message:
Remember that an URL or URI always has %nn escaping already enabled
(see RFC 2396, section 2.4.2) and therefore does not contain the
characters '"', '<', '>' (see RFC 2396, section 2.4.3).
A '<' is output, then the URL or URI is output literally, then a '>'
is output.
- For parsing:
- If the first character is a '<'. then it's an URL or URI. It
ends at the next '>' character. The URL or URI is the substring
from the leading '<' (exclusive) to the next '>' (exclusive).
- If the first character is a '"', then it's a filename in escaped
syntax. It ends at the next '"' character. Unescaping converts
each %nn to the US-ASCII character with byte value nn.
- Otherwise, it's a filename, and the filename ends at the first ':'
or end of line.
Properties of this proposal:
- The proposal handles both filenames and URLs or URIs.
- The user only sees one escape syntax, namely the %nn notation defined
by RFC 2396. (In other proposals, different escape syntaxes were used
for filenames and for URLs.)
- The user is already familiar with the '<...>' notation for URLs.
- The user can copy&paste URLs from the output in all cases. (Just the
part inside '<...>'.)
- In most cases, filenames are output literally and can therefore be
copy&pasted by the user. This holds both for filenames on Unix
("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax).
- The output does not contain newlines; therefore a program that parses
the output can proceed by reading line by line.
- The output can be parsed without prerequisite knowledge whether a
location is a filename or an URL. It's explicit.
- The output can be parsed without reference to a particular encoding
for non-ASCII characters: Unescaping is limited US-ASCII characters
inside filenames.
==============================================================================
==============================================================================
Proposal A:
- For output of a filename in an error message:
The escaped syntax is required if the filename contains a ':' or
newline, or starts with a '"' or '<'. It may also be used for other
filenames.
In escaped syntax, a '"' is output. then the filename is output, with
escaping:
- Occurrences of '"' and '%' and newline are escaped as
%22 and %25 and %0A, respectively,
- Other US-ASCII characters may be escaped in %nn syntax as well,
where nn is the hexadecimal notation (case insignificant)
of the byte value in the US-ASCII encoding.
Finally a '"' is output.
Otherwise, the filename is output literally, without modifications.
- For output of an URL in an error message:
Remember that an URL or URI always has %nn escaping already enabled
(see RFC 2396, section 2.4.2) and therefore does not contain the
characters '"', '<', '>' (see RFC 2396, section 2.4.3).
A '"' is output, then the URL or URI is output literally, then a '"'
is output.
- For parsing:
- If the first character is a '"', then it's a filename or URL in
escaped syntax. It ends at the next '"' character. Unescaping converts
each %nn to the US-ASCII character with byte value nn. Some heuristic
is used to distinguish between filename (with unescaping) and URL
(no unescaping performed).
- Otherwise, it's a filename, and the filename ends at the first ':'
or end of line.
Properties of this proposal:
- The proposal handles both filenames and URLs or URIs.
- The user only sees one escape syntax, namely the %nn notation defined
by RFC 2396. (In other proposals, different escape syntaxes were used
for filenames and for URLs.)
- The user can copy&paste URLs from the output in all cases. (Just the
part inside '"..."'.)
- In most cases, filenames are output literally and can therefore be
copy&pasted by the user. This holds both for filenames on Unix
("/f/00/bar" syntax) and filenames on Windows ("c:\f\00\bar" syntax).
- The output does not contain newlines; therefore a program that parses
the output can proceed by reading line by line.
- The output can be parsed without reference to a particular encoding
for non-ASCII characters: Unescaping is limited US-ASCII characters
inside filenames.
==============================================================================
> Anyway, this decision seems like a judgement to me, not something that
> can be considered provably better.
Avoiding the heuristic is undoubtedly a good point. I prefer proposal B over A
myself. The use of '<' and '>' as delimiters for an URL also matches the
convention used in email and elsewhere on the web.
> And rms is the one whose judgement counts. I'll write him about it unless
> you want to do so.
You're welcome to write to him. Either proposals A and B, or only B, or
others, as you like. I think he will need the "Properties" sections of the
proposals - it's not obvious for someone who has not participated in this
discussion to see which proposal has which properties.
Bruno
- Re: special characters in filenames in error messages, (continued)
Re: special characters in filenames in error messages, Bruno Haible, 2008/12/03