[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character translation, hyphenation, and adjustment (was: Do Latin-2-
From: |
G. Branden Robinson |
Subject: |
Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?) |
Date: |
Wed, 13 Nov 2024 19:08:22 -0600 |
Hi onf,
At 2024-11-13T23:45:59+0100, onf wrote:
> On Wed Nov 13, 2024 at 9:48 PM CET, G. Branden Robinson wrote:
> > I guess I interpret those words more generally than you do. To me,
> > "prior to output" can mean _any time_ prior to output (once the
> > formatter has started running), and you seem to be inferring some
> > later stage of processing.
>
> To me, output seems to imply putting words on a page. I haven't given
> the word's meaning in the context of troff much thought before,
> though.
There's a whole sausage factory involved. ;-) I had to cover myself at
one point when I wrote the introductory section of chapter 5 of our
Texinfo manual.
---snip---
(2) This statement oversimplifies; there are escape sequences whose
purpose is precisely to produce glyphs on the output device, and input
characters that _aren't_ part of escape sequences can undergo a great
deal of processing before getting to the output.
---end snip---
> I don't disagree. I just wasn't sure how complex adding full UTF-8
> support is.
I'm not sure either. But it's got to be done.
> It's not so long ago I saw some mentions of support for
> the \[u_...] characters being added to some driver,
You might be thinking of this:
commit a6289c1508acf31dce73da2ffa9e7de102986298
Author: G. Branden Robinson <g.branden.robinson@gmail.com>
Date: Wed Aug 21 08:40:27 2024 -0500
font/devps/ZD: Regen from updated dingbats.map.
* font/devps/ZD: Regenerate using updated dingbats.map.
Fixes <https://savannah.gnu.org/bugs/?63018>. Thanks to Deri James and
Dave Kemper for (extensive) consultation.
...of which part of the commit's diff looks like:
+u27BA 831,579 3 250 a187
+u27BB 873,578 3 251 a188
+u27BC 927,542 3 252 a189
+u27BD 970,616 3 253 a190
+u27BE 918,593 3 254 a191
> so I figured it might for some reason be much easier than proper UTF-8
> support.
That's a different part of the problem. We can express any Unicode code
point in GNU troff _output_. The reason people say "groff doesn't
support UTF-8" is that GNU troff, the formatter program specifically,
does not correctly interpret UTF-8-encoded input files.
The other reason people say it is that it is a simple claim to make, and
they don't particularly care if it is false or misleading. They
generally have some other piece of software they're advocating instead.
> They should have called .ad without arguments .ra (restore adjustment,
> like .rs) and we could have avoided all that confusion.
That kind of gets the chronology backwards.
Saltzer's RUNOFF (1966) had `ad` and `na`. `ns` and `ra` arrived with
Ossanna's "new roff" (nroff, 1973) in Third Edition Unix.
I don't know what devils whispered in Ossanna's ear to overload the
meaning of `ad`, but I've got some invective stored up for them if
they're still alive. For once I don't presume it's Thompson's fault;
the two-letter namespace wasn't anywhere near saturated yet.
> Perhaps, but you said it works fine for "temporary disablement with
> `nh`". Disabling hyphenation once and for all does not classify as
> temporary disablement, imho.
You're kind of confusing me here. Whether changing the line length with
`ll` is "temporary" or not depends on whether you issue a subsequent
request to do so. In _this_ respect, disabling hyphenation is no more
or less permanent than most other operations in troff.
There are _some_ irreversible ones, like adding to the hyphenation
exception list with `hw`.
> And it working poorly and the user just not noticing doesn't mean it
> works fine, either.
I was being a little facetious, and giving you a sample of the kind of
evasions I run into when people don't want to fix their broken man
pages.
> > [...]
> > > Sounds like what .hy should have been doing from the beginning :)
> >
> > Or it should have worked like .ev, .ft, .in, .ll, .ls, .lt, .po,
> > .ps, or .vs--yes. And had an introspection register, damn it.
>
> That sounds too good, I would be happy if it at least worked
> like .ad does... (:
I wasn't bold enough to change it to that degree. But the groff 1.24
implementation maintains AT&T compatibility if you don't load a
localization file, and AT&T troff had no localization files to load.
> In my case, if I really can't figure out a way to make something work,
> I give up, figuring that a less complex approach will be easier to
> work with and debug in the future anyway.
It's also okay to ask others. That's one of the reasons this mailing
list is here. Also, occasionally something is hard because troff's
design isn't everything it could be.
> The mixing of alignment and adjustment functionality in .ad puzzles me
> to this day, especially combined with the existence of .ce and .rj.
.rj is a GNU extension. (Well, maybe SoftQuad troff had it--I don't
know and have not _ever_ been able to find an sqtroff manual, a
supposedly award-winning treatise, be it scanned or in dead tree form.)
But yes, with that caveat, I strongly agree, as
<https://savannah.gnu.org/bugs/?65954> suggests.
> My proposal was based on the assumption that maintaining compatibility
> with other troffs is desired.
I'm concerned mainly with compatibility only with AT&T troff. Heirloom
Doctools troff and neatroff both came along much later and I'm not aware
that a large corpus of documents has ever been written specifically for
them.
Further, slavish fidelity to AT&T troff behavior has never been a
feature of GNU troff. The compatibility mode flag `-C` attempts to get
you a few nines of the way there, but there are areas where CSTR #54 is
vague (also, it has errors in it), and where the operation of the
formatter is hard to fathom.
25 years ago, Trent Fisher (or maybe James Clark or Werner Lemberg) made
the following observation:
---snip---
.cf filename
When used in a diversion, this will embed in the diversion an object
which, when reread, will cause the contents of filename to be
transparently copied through to the output. In @sc{Unix} troff, the
contents of filename is immediately copied through to the output
regardless of whether there is a current diversion; this behaviour is so
anomalous that it must be considered a bug.
---end snip---
All that said, when I make a behavior change to groff, I strive to
document the rationale convincingly. I've discarded plenty of my own
ideas (you can search for Savannah tickets filed by me, assigned to me,
and closed with status "Rejected"), and folks like Dave and Deri have
talked me out of others. :)
Regards,
Branden
signature.asc
Description: PGP signature
- Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/05
- Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/12
- Re: Do Latin-2-based hyphenation files work with Unicode?, G. Branden Robinson, 2024/11/13
- Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/13
- Re: Do Latin-2-based hyphenation files work with Unicode?, G. Branden Robinson, 2024/11/13
- Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/13
- character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), G. Branden Robinson, 2024/11/13
- Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), onf, 2024/11/13
- Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?),
G. Branden Robinson <=
- Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), G. Branden Robinson, 2024/11/13
- Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), onf, 2024/11/14