[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposed: change `pm` request argument semantics (was: process man(7) (o
From: |
G. Branden Robinson |
Subject: |
Proposed: change `pm` request argument semantics (was: process man(7) (or any other package of macros) without typesetting) |
Date: |
Thu, 17 Aug 2023 18:44:14 -0500 |
Hi Alex,
At 2023-08-17T21:12:35+0200, Alejandro Colomar wrote:
> I've had this desire for a long time, and maybe now I have a strong
> reason to ask for it.
[...]
> The problem is that at no point you can have the .roff source, after
> the man(7) macros have been expanded. Would it be possible to split
> the groff(1) pipeline to have one more preprocessor, let's call it
> woman(1) (because man(1) is already taken), so that it translates
> man(7) to roff(7)?
In other words, you want to see what a *roff document looks like after
all macro expansions have been (recursively) performed.
I wanted this, too, back in 2017 when I first started working on groff.
The short answer is "no".
The longer answer is that this is hard because GNU troff, like AT&T
troff, never builds a complete syntax tree for the document the way
"modern" document formatters do. nroff and troff were written and
deployed on DEC PDP-11 machines that are today considered embedded
microcontroller environments. Therefore they handled as little input at
one time as possible. Roughly, this meant that input was collected,
macro-expanded as soon as it was seen, and then as soon as it was time
to break an output line, a lot of formatter state related to parsing was
flushed, and it started reading input again.
Understanding *roff a little better 6 years later, I can more easily
imagine ways to run AT&T troff out of memory on a PDP-11. Ultra-long
diversions would be one way,[1] because formatted diversion contents
have to be kept in memory until they're called for. A multiplicity of
moderately sizes diversions would do it too. Conditional blocks would
be another problem. When encountering a brace escape sequence \{, the
formatter has to scan ahead in the input. Or at least GNU troff does.
Maybe AT&T troff did something clever, but its source code is famously
opaque.
I'll say it before Ingo does: mandoc(1) (as I understand it) _does_
build a syntax tree for the entire document before producing output,
which enables some of the nice features that it has.
I see Lennart has replied with some further exploration of the
challenges here. Rather than duplicate his comments, let me move on to
something vaguely related but, I hope, potentially useful.
Can we do something that might help without re-architecting GNU troff?
I think we can. I've been mulling this for months, and now that I'm on
the threshold of implementing a `for` request as a string iterator,[2]
I think I want something else first, largely to help me test it.
I want string/macro/diversion dumper.
groff(7):
.pm Report, to the standard error stream, the names and
sizes in bytes of defined macros, strings, and
diversions.
groff_diff(7):
In AT&T troff the pm request reports macro, string, and diversion
sizes in units of 128‐byte blocks, and an argument reduces the
report to a sum of the above in the same units. GNU troff
ignores any arguments and reports the sizes in bytes.
That's fine, but what if we want to look _inside_ a macro, string, or
diversion?
I propose to implement this:
.pm name Report the contents of macro, string, or diversion
name to the standard error stream. If name is
undefined, an error is produced (to distingush this
case from an empty object). Newlines and ordinary
characters are written as-is on lines indented one
space. Special characters are represented in \[xx]
notation regardless of the selected escape character
or input syntax. Tabs, leaders, unprintable control
characters, and nodes are described on lines with no
indentation.
I suggest that this won't break existing code because:
A. GNU troff has ignored arguments to `pm` for ~33 years; and
B. The format of debugging output (`troff -a`, `pm`, `pnr`, `pev`,
`ptr`), is not, and likely should not be, rigidly specified.
Example of an interactive session using the feature (purely notional,
typed into my editor window):
$ groff
.ds foo hello \(aq apostrophe\" string contents are read in copy mode
.pm foo
hello \(aq apostrophe
.de bar
. ft B
. nop Hello, world!
. ft
..
.pm bar
.de bar
. ft B
. nop Hello, world!
..
. ft
.ds toc*entry 1.1^IIntroduction^Aiii
.pm toc*entry
1.1
tab
Introduction
leader
iii
.de OB\"noxious old fart who knows tricks
. if ^B\\$1^Bfatal^B .ab \" get out in a panic
. ex \" exit more calmly
..
.pm OB
.de OB
. if
^B
\\$1
^B
fatal
^B
.ab
. ex
..
A problem with the above format is that trailing spaces before newlines
would not be obvious. I'm thinking that won't be too hard to address;
the dumper can count spaces until it encounters something that isn't
a space, newline, or the end of the object. We could then have
something like this.
.pm OB
.de OB
. if
space
newline
^B
\\$1
^B
fatal
^B
.ab
space
newline
. ex
space
newline
..
It would be more consistent, and possibly better, to just mark all
newlines thus.
I admit I don't really know yet what I'll be dealing with when it comes
to dumping nodes (which will be all over the place in diversions).
But, then, that aspect of groff seems to have mystified many over the
years.[2] I very much hope that being able to "debug print" them will
start to clear away the smoke and confusion. I want to do more than
just say that a node has been encountered. I want something like this.
.di mydiv
ca-fe
.ft B
heavy
.di
.pm mydiv
node {type=glyph, id='c', font-position=1}
node {type=glyph, id='a', font-position=1}
node {type=glyph, id='\hy', font-position=1}
node {type=glyph, id='f', font-position=1}
node {type=glyph, id='e', font-position=1}
newline
node {type=glyph, id='h', font-position=3}
node {type=glyph, id='e', font-position=3}
node {type=glyph, id='a', font-position=3}
node {type=glyph, id='v', font-position=3}
node {type=glyph, id='y', font-position=3}
True node data will, I'm sure, be much more complex and verbose. Likely
my first cut would be lamer.
.pm mydiv
node
node
node
node
node
newline
node
node
node
node
node
But I would want to swiftly improve that to report at least some basic
type information about the node. Once I know what that looks like.
Any objections?
Regards,
Branden
[1] Nobody _except_ mandoc(1) seems to handle this well. Credit where
it's due. https://savannah.gnu.org/bugs/?64229
[2] https://lists.gnu.org/archive/html/groff/2020-10/msg00105.html
signature.asc
Description: PGP signature