Re: format bug

bug-m4
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: format bug

From:	Daniel Richard G.
Subject:	Re: format bug
Date:	Fri, 1 Jun 2007 13:02:22 -0400
On Thu, 2007 May 31 21:05:04 -0600, Eric Blake wrote:
> 
> POSIX standardized existing practice.  printf(1) was invented years ago,
> in the Ninth Edition system (before I was even using Unix), and parsed the
> argument for %b, %c, and %s as a string, and for all other specifiers as
> integers (the original printf(1) did not support %e, %f, or %g; although
> most modern implementations of printf(1) support that as an extension).

All very good, but the same provenance gave us a df(1) that reports disk 
sizes in 512-byte blocks. Unix history and POSIX are venerable, but let's 
not see them for more than what they are....

> That's the point of this thread - I'm offering to implement escape
> sequences (also part of printf(1)) to make up for the fact that I am about
> to change %c semantics to be consistent with other, more standardized,
> utilities.

I wasn't thinking this was a quid pro quo, but okay.

The thing is, you're going to end up with a %c specifier that is 
consistent, straightforward, and nigh-on useless. It'll basically be an 
easier-to-remember alias for "%.1s". At that rate, you may as well just 
remove it altogether.

> The problem is that with the shell (printf) and with m4 (format), all
> arguments start out as strings, with no strong typing to tell whether the
> string should be interpreted as a number or left as a string.  So seeing
> the string "1" makes it impossible to tell whether the user meant the
> character with value 1 (C's '\1') or the literal 1 (C's '1').  And I tend
> to value consistency (it's easier to state that m4 is like the shell, than
> it is to say that m4 is a special case and behaves differently), as long
> as there is also a way to accomplish the alternate interpretation.

Expecting format(%c,1) to give a literal '1' strikes me more as an issue of 
the user not understanding what %c does, rather than an ambiguity problem. 
Just as if that very user tried format(%d,A) and expected that to give 
anything sensible.

But okay, if the ambiguity is a problem, why not take a different approach? 
printf(1) is nice, but it's not so big and important in the Unix pantheon 
to be the only point of reference. (I worked on Unix for many years without 
knowing about it; all the scripts I saw would use e.g. "/bin/echo -e" or 
"perl -pe" to do fancy stuff beyond builtin-echo's abilities.)

Some ideas, off the top of my head:

        A. Double-quoting, to simulate a C character literal
           format(`%c', `1')   => \001
           format(`%c', ``1'') => 1

        B. "Character literal" flag character (using ':' here)
           format(`%c', `1')  => \001
           format(`%c', `:1') => 1

        C. New format modifier (using '.' here)
           format(`%c', `1') => \001
           format(`%.c', `1') => 1

        ...

> And it is also why I will be implementing \nnn octal and \xnn (required 
> by POSIX in printf), but was not planning on \unnnn or \Unnnnnnnn at this 
> point (also not required by POSIX), because implementation-defined 
> multibyte characters don't make sense until m4 can do multibyte 
> characters to begin with.

What do you mean by m4 "doing" multi-byte characters? Rudimentary UTF-8 
support is a freebie; proper support would amount to making sure regexes 
and such know about the character semantics.

I would actually consider high-character generation to be outside the scope 
of "multi-byte support" proper, however. It's directly analogous to the 
user feeding UTF-8 text into (MB-unaware) m4; it's possible for the same 
reasons, and has the same potential pitfalls, for the same reasons.

> The fact that %lc ever worked for you is an undocumented happenstance and
> a sign of non-portability; it did not exist in printf(3) when Rene' first
> implemented m4 back in 1990, so it was not part of the original design of
> GNU m4.  In reality, m4 has never been locale-aware, and more platforms
> probably got %lc wrong than those that seemed to get it right.  I can
> consider reenabling it, now that you have brought it up, but since m4
> still operates on bytes and not multibyte characters, I'm not sure it is
> the right thing to knowingly enable something that is likely broken.  In
> general, using undocumented aspects of a program is subject to these sorts
> of changes in behavior.

%lc was a nifty, if unintended feature; I'm all for m4 at least not 
blocking access to it if the underlying system supports it.

As for broken systems, I think it would be enough to say, "If your C 
library can't do MB properly then neither will m4." If worse comes to 
worst, the configure script could check for this and disable MB support 
altogether if it's not all there.

> > (I've been using a chr() composite implemented with %lc, which I thought 
> > was extremely cool on finding that m4 supported it. Now I'm a bit miffed 
> > that the latest CVS code no longer recognizes this...)
> 
> Well, since you are the first to bring it up, maybe we can consider
> documenting it, and making it something we support (and adding regression
> tests, to make sure we don't inadvertently break it again in the future).

I think it's a useful thing to have, so I'd definitely like to see this 
happen. Documentation would help---I came across %lc only in studying the 
man page for printf(3).

> Meanwhile, you've made enough comments about how m4 behaves that you may
> want to consider assigning copyright to the FSF and so that you can
> contribute patches to help move it along to better meet your needs.  I
> tend to work on the things that bother me, and I have not yet been
> bothered by the inability of using multibyte characters in m4 to the point
> to make the code changes necessary to support locales properly.

I have some other code simmering that I'd eventually like to contribute, so 
this sounds like a good step. Where do I go?

> > May I suggest borrowing a page from Perl, and adding a builtin like ord()? 
> > That would be much more efficient than a lookup table, and it would be able 
> > to handle multi-byte characters.
> 
> No, we don't need another builtin.  For 1.4.9 and earlier it is too late,

FYI: Everything I say concerns 2.0. I'm using a TeX-alike syntax, and 1.4 
doesn't have changesyntax(), so the branch is completely off my radar.

> we can't re-release those versions with a new builtin, so you are already
> stuck with implementing it as a hairy composite.  And with 1.4.10 and 2.0,
> format(`%d',`"a') is the quick and easy way to do this, so you can easily
> write an ord() composite that wraps format.

Whoa, hold the phone---is this in CVS yet?

        format(`%d',`"a')
        => 0

This is interesting... a way to have a literal character interpreted in 
terms of its integer codepoint. Do I see here the glimmer of a solution to 
a different, but closely related problem? :-)

> My comment was that there is no simply way to do this portably across 
> both 1.4.9 and 1.4.10 simultaneously, so I hope to code up some sort of 
> composite that can do the trick in 1.4.9 and add it to the examples 
> directory.

Let me guess---a table-lookup routine. Oy!

(m4 is nice, but sometimes it can feel like programming a raw Turing 
machine...)


--Daniel


-- 
NAME   = Daniel Richard G.       ##  Remember, skunks       _\|/_  meef?
EMAIL1 = address@hidden        ##  don't smell bad---    (/o|o\) /
EMAIL2 = address@hidden      ##  it's the people who   < (^),>
WWW    = http://www.******.org/  ##  annoy them that do!    /   \
--
(****** = site not yet online)
[Prev in Thread]
Current Thread
[Next in Thread]
Re: format bug, Daniel Richard G. <=
Prev by Date: Test results for m4-1.4.9b
Next by Date: Re: M4 1.4.9b testsuite failure
Previous by thread: Test results for m4-1.4.9b
Next by thread: Re: M4 1.4.9b testsuite failure
Index(es):
- Date
- Thread