[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Comparison against backslash obtained via .substring
From: |
onf |
Subject: |
Re: Comparison against backslash obtained via .substring |
Date: |
Sun, 03 Nov 2024 08:50:03 +0100 |
Hi Branden,
thank you for having patience with me :)
I have joined all the replies here.
On Sun Nov 3, 2024 at 4:36 AM CET, G. Branden Robinson wrote:
> At 2024-11-03T03:56:23+0100, onf wrote:
> > Ugh, I should have taken more time to reply -- I missed the fact that
> > groff doesn't consider DEL a control character. Thanks for the hack,
> > it works...
>
> Glad to hear it! Let me put on a familiar hat and suggest that you use
> different terminology, though.
>
> In *roff, a "control character" is something that the formatter
> recognizes as starting a "control line". Using the same term to refer
> to properties of characters from the encoding your system uses can lead
> to confusion. [...]
Agreed. I am aware of troff's concept of control character, I just
haven't realized it's called exactly the same as the concept used by
ASCII, POSIX ERE character classes, ECMA-48 (iirc) and so on.
> [...]
>
> The reason DEL works is not because it is or isn't a control character,
> but because it's a valid input character, like ^B, ^C, and several
> others. (Historically, ^G was popular in attempts to avoid the problem
> in the next paragraph.)
> [...]
The reason I tried using a "control" (non-printing) character is because
I can be sure it's not going to occur in input, not because I thought
groff treats it specially.
On Sun Nov 3, 2024 at 3:46 AM CET, G. Branden Robinson wrote:
> At 2024-11-03T02:53:14+0100, onf wrote:
> > changing the escape character hasn't occured to me, that's clever!
> > Unfortunately it doesn't work -- groff won't allow me to set the
> > escape character to a control one,
>
> It does, but it has to be a valid input character.
>
> groff(7):
> On a machine using the ISO 646, 8859, or 10646 character
> encodings, invalid input characters are 0x00, 0x08, 0x0B,
> 0x0D–0x1F, and 0x80–0x9F. On an EBCDIC host, they are 0x00–0x01,
> 0x08, 0x09, 0x0B, 0x0D–0x14, 0x17–0x1F, and 0x30–0x3F. Some of
> these code points are used by troff internally, making it non‐
> trivial to extend the program to accept UTF‐8 or other encodings
> that use characters from these ranges.
> [...]
Thank you for pointing to that. I now remember that I had already seen
this, but have completely forgotten about it since. Seems I just picked
the wrong non-printing characters.
On Sun Nov 3, 2024 at 4:19 AM CET, G. Branden Robinson wrote:
> At 2024-11-03T03:25:01+0100, onf wrote:
> > [...] Adding a string iterator would fix this, although it would make
> > my code significantly more complex as I would have to compare the
> > strings character by character [...]
>
> One reason to have the string(/macro/diversion) iterator request is that
> as soon as do, we can use it to construct a "string library" macro
> package. "string.tmac" seems like a likely name.
>
> What I envision is removing several of the string-handling requests from
> GNU troff and replacing them with macros in "string.tmac". [...]
> "string.tmac" would also be a useful place to experiment with things
> like:
> .strchr
> .strrchr
> .index
> .rindex
> .slice (return a substring using Python-esque indexing)
> And maybe AWK-like replacement macros:
> .sub
> .gsub
> [...]
Yup, haven't realized the possibilities with such an approach. Seems
like a great idea! I would suggest replacing strchr with strpbrk though
-- it's useful being able to look for more than a single character.
Speaking of which, having unicode-capable ctype macros would be quite
helpful too. (I have recently written code that would really benefit
from an ispunct macro.)
> > A problem with this solution is that it's incomplete. It addresses a
> > particular issue arrising from troff's usage of macro substitution,
> > but doesn't solve the others. For instance, I would still run into
> > issues if I tried to compare a literal ' against anything and
> > delimited the comparands by the same character, which can happen with
> > the proposed iterator mechanism:
> > .ie '\\*[ch]'"' \" ...
>
> We can't verify or refute that claim until the code is in place, but I
> expect that you are wrong about this, unless you run the formatter in
> AT&T compatibility mode (in which case the syntax `\*[ch]` won't work
> anyway).
>
> info '(groff) Compatibility Mode':
>
> Normally, GNU 'troff' preserves the interpolation depth in
> delimited arguments, but not in compatibility mode.
>
> .ds xx '
> \w'abc\*(xxdef'
> => 168 (normal mode on a terminal device)
> => 72def' (compatibility mode on a terminal device)
>
> > $ groff -b -ww -z
> > .ds str "'\"
> > .ie '\*[str]'\'' .tm groff: single quote
> > .el .tm groff: else
> > groff: else
>
> This fails because `\` does not escape the apostrophe the way you think
> it does. `\'` is a special character escape sequence.
> [...]
> You'll need to do this a slightly different way when attempting to match
> a character that happens to be the same as the delimiter in a formatted
> output comparison. One layer of indirection will do.
> [...]
Thanks for taking the time to explain this. I think I had somewhat
assumed it works this way, but the fact that substituting in the escape
character from a string (as in my first message) is capable of escaping
the comparand delimiter really confused me. I still don't get how that's
possible if groff preserves the interpolation depth as you say...
I will be happy to be wrong about any possible issues, though :)
~ onf