character translation, hyphenation, and adjustment (was: Do Latin-2-base

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

character translation, hyphenation, and adjustment (was: Do Latin-2-base

From:	G. Branden Robinson
Subject:	character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)
Date:	Wed, 13 Nov 2024 14:48:54 -0600

Hi onf,

At 2024-11-13T20:42:57+0100, onf wrote:
> On Wed Nov 13, 2024 at 7:25 PM CET, G. Branden Robinson wrote:
> > [...]
> > > i.e. translation should happen on output, not on input,
> >
> > I'm not sure I agree with that, given the above.  When I see `tr` used,
> > it is typically to make input more convenient.
> 
> I never said it's not used like that. I just meant to say that groff(7)
> suggests the translation happens at the moment the character is
> formatted for output rather than at the moment it is read in:
>   .tr abcd...
>       Translate ordinary or special characters a to b, c to d, and
>       so on PRIOR TO OUTPUT. [emphasis added]
> 
> which is why I wondered about the things you quote below.

I guess I interpret those words more generally than you do.  To me,
"prior to output" can mean _any time_ prior to output (once the
formatter has started running), and you seem to be inferring some later
stage of processing.

What exactly the stages are is hinted at by the subsection "Using
Symbols" of our Texinfo manual, but I have not attacked the claims that
it makes with my usual battery of experiments and, worse, it doesn't
at all discuss how the `tr`, `trin`, and `trnt` interact with the
multi-stage lookup process shown.

That part of the manual needs work.  I'll get to it some day.

> > While I don't have an ETA for that, I don't want to complicate the
> > formatter itself with any features to make eight-bit encodings more
> > convenient to use.  That feels like throwing good money after bad.
> > UTF-8 is the future.  Heck, it's the present, most places.
> 
> I think if anything, this thread demonstrates the complexity that
> arises from using multiple character encodings.

Agreed.

> I was just trying
> to make it work that way because that's what we have now, but it
> would obviously be much better if one could use UTF-8 directly in
> the hyphenation files (or at least the \[u...] characters) without
> having to jump through all these hoops.

We get our hyphenation pattern files from various TeX-related projects.
I observe that these either use a "native" 8-bit encoding or
(increasingly as years pass) UTF-8.  Implementing support for spelling
non-ASCII characters with \[uXXXX] escape sequences seems like a detour
to me.  If we solve the formatter's UTF-8 reading disability, the
problems with the character encoding of hyphenation pattern files should
pretty much disappear.

> > [...]
> > > groff(7) does mention it, but it's among the last things mentioned
> > > in the Hyphenation section. The texinfo manual doesn't mention it
> > > at all in its section 5.1.3 about Hyphenation where I would expect
> > > it.  (At least the online version -- I haven't found any git
> > > source for it, just tarballs.)
> >
> > You can review up-to-date documentation here:
> >
> > https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0
> >
> > The Git source for the bleeding edge of our documentation is at:
> >
> > https://git.savannah.gnu.org/cgit/groff.git/tree/doc/
> > https://git.savannah.gnu.org/cgit/groff.git/tree/man/
> 
> Thanks; I overlooked the texinfo source in the doc/ directory. I don't
> notice any changes to the hyphenation-related sections that would make
> it obvious one should load the appropriate localization files rather
> than do it 'by hand' (i.e. by using .hpf etc.), though.

Okay.  I'll review that material in the window between now and the groff
1.24 release to see if I can knock it into more helpful shape.

> (By the way, that Dropbox PDF viewer is borderline unusable

I agree.  To list all the frustrations I've had with it would double the
length of this message.

> and downloading the PDF requires logging in.

Oh, yuck.  No wonder I never get any feedback on it.

> If you ever need something less bloated, I recommend
> <https://paste.c-net.org>.)

Thanks!  I'll look.  I am not a _happy_ Dropbox user, merely one who
doesn't know a better place.

> > That's actually a bad example, but a very popular misconception.  You
> > probably mean "if .hy worked like .ps". Or .ft, .ev, .in, .ll,
> > .ls, .lt, .po, or .vs;, or groff's .fam, .fcolor, .gcolor, or .pvs.
> >
> > Without an argument, neither .hy nor .ad restore the "previous"
> > hypenation mode or adjustment setting, respectively.
> 
> That's not a bad example, you just misunderstood. I know .ad without
> argument doesn't restore previous adjustment mode; it caused me some
> headaches in the past.

It causes headaches for me to this day, when I have to talk man page
authors out of their incorrect uses of it!

> I eventually realized that .ad is not meant to switch back-and-forth
> between adjustment modes, but to restore adjustment after it was
> disabled with .na.

Right!  That's how it was born.  See my email from last month about
Sixth Edition troff and the telling shape of the `ad` request at that
time,[1] before Seventh Edition and CSTR #54 came along and a sort of
religious cult organized around the idea that troff had worked that way
all the way back to 1971.

("But the Labs didn't even _have_ a typesetter back then--"

"Yes, exactly!")

> What I was saying above is that if .hy worked in this way too, i.e. if
> .hy without arguments restored hyphenation after .nh was called, the
> macro I proposed wouldn't be necessary.

I agree.  Alas, the CSRG seems to have thought of hyphenation as a
one-time, set-and-forget configuration parameter.  There wasn't even a
way in the AT&T troff language for a document to inquire _what_ the
hyphenation mode was.

> > [...]
> > I think these are horrible warts in the *roff language that an
> > iconoclast should have smashed years ago.  But they work fine for the
> > most common cases (temporary disablement with `nh` and `na`,
> > respectively) [...]
> 
> I would disagree it works fins for temporary disablement with .nh;
> see above.

It does if you're using AT&T troff with its bespoke hyphenation system,
or you're a man page author who either hates automatic hyphenation or
doesn't pay very close attention to where hyphenation breaks occur.

The "screw this" attitude of many software developers toward man pages,
toward *roff, and toward documentation in general, has led to much
debilitating inertia.

groff should have done something about this as soon as it imported TeX's
hyphenation system; I don't know why it didn't.  Same problem I have, I
expect--not enough spoons.

> > >  but (unless I am mistaken again :) it doesn't and cannot due to
> > >  desired compatibility with AT&T troff.
> >
> > You might be interested in a feature in the forthcoming groff
> > 1.24.0:
> >
> > NEWS:
> > *  A new request, `hydefault`, and read-only register, `.hydefault`,
> >    manage the default automatic hyphenation mode of an environment.
> >    This resolves a long-standing problem of *roff formatting.
> >
> >      When processing input like this,
> >      .nh
> >      and we temporarily shut off automatic hyphenation,
> >      .hy
> >      the foregoing request would not do exactly what we expect.
> >
> >    AT&T and other troffs would set the hyphenation mode to 1 instead
> >    of the previous value; for GNU troff this was not an appropriate
> >    value for the English hyphenation patterns.  (For example,
> >    "alibi" would break as "ali-bi" instead of "al-ibi" after this
> >    argumentless `hy` invocation.)  With updates to groff's
> >    localization files, the foregoing input now works as desired.
> 
> Sounds like what .hy should have been doing from the beginning :)

Or it should have worked like .ev, .ft, .in, .ll, .ls, .lt, .po, .ps, or
.vs--yes.  And had an introspection register, damn it.

Speaking of introspection, I've added several new requests to the GNU
troff language for 1.24.  I barrelled ahead with these because they have
no effect on formatter state--they are there solely to help a person
troubleshooting groff, a macro package, or a document figure out what's
going on.  In my opinion, a person who is not developing groff itself
should never have to launch GDB to discover relevant state about how
their document is being formatted.  All too often, though, that is the
case.  Less often in 1.24, I hope.

groff(7):
     .pcolor    Report, to the standard error stream, each defined color
                name, its color space identifier, and channel value
                assignments.  A device’s default stroke and/or fill
                colors, “default”, are not listed since they are
                immutable and their details unknown to the formatter.
     .pcolor col ...
                Report, to the standard error stream, the name, color
                space identifier, and channel value assignments of each
                color col.
     .pcomposite
                Report, to the standard error stream, the list of
                defined composite characters.  The “from” code point is
                listed first, followed by its “to” mapping.
     .phcode c ...
                Report, to the standard error stream, the hyphenation
                code of each ordinary or special character c.
     .phw       Report, to the standard error stream, the list of
                hyphenation exceptions associated with the current
                hyphenation language.  Each hyphenation point is marked
                with “-”.  Words that will not be hyphenated at all are
                prefixed with “-”.  Those to which the automatic
                hyphenation mode applies (meaning those defined in a
                hyphenation pattern file rather than with the hw
                request) are suffixed with a tab and asterisk (*).
     .pline     Report, to the standard error stream, the list of output
                nodes corresponding to the pending output line.  The
                list is empty if there are none.
     .pnr       Report the names, contents, and (as applicable) assigned
                formats of all defined registers to the standard error
                stream.
     .pnr reg ...
                Report the name and value and, if the value is numeric,
                the assigned format of each register reg, to the
                standard error stream.

`pnr` isn't new, but previously has been available only in its
zero-argument form, and did not report the number format.

For groff 1.25 I hope to get `pline` fleshed out a bit more, and add a
`pchar` request that tells you how a given ordinary or special character
resolves to a glyph (user-defined character, text font glyph, special
font glyph, etc.).  The big kahuna is to "break" `pm`...

groff_diff(7):
     In AT&T troff the pm request reports macro, string, and diversion
     sizes in units of 128‐byte blocks, and an argument reduces the
     report to a sum of the above in the same units.  GNU troff ignores
     any arguments and reports the sizes in bytes.

...and accept a macro, string, or diversion identifier as an argument,
then dump its contents in an unambiguous form.  Getting the `pline` work
done will help a lot with that, since the biggest obstacle to doing this
right now--the biggest question mark--is how to represent the many types
of "node" object the formatter has.

> > I have plans to fix the argumentless `ad` request, but just today I
> > decided to kick that out past 1.24.
> >
> > https://savannah.gnu.org/bugs/?65954
> 
> I don't feel like this fixes anything, honestly.
> Before this, I could do:
>   .ad r
>   Lorem ipsum dolor sit amet...
>   .br
>   .na
>   Lorem ipsum.
>   .br
>   .ad
>   Lorem ipsum dolor sit amet...
> and couldn't do:
>   .ad r
>   Lorem ipsum dolor sit amet...
>   .br
>   .ad c
>   Lorem ipsum.
>   .br
>   .ad
>   Lorem ipsum dolor sit amet...
> 
> Now I will not be able to do either. I suggest this instead:
>   .ad
>       Set adjustment mode to \n[.J] if set, b otherwise.
>   .ad 0
>       Disable adjustment.
>       Update \n[.j] and \n[.J] (previous value of \n[.j]).
>   .ad MODE
>       Set adjustment mode to MODE (l,c,r,b,n).
>       Update \n[.j] and \n[.J].
>   .na
>       As .ad 0.
> 
> This should make both scenarios work as expected without breaking any
> other ways in which people currently use it. (At least I hope so.)

It's pretty important to me to detangle adjustment from alignment.
Continuing to heap complications on the existing `ad` request doesn't
seem like a promising path forward to me.

I'll have to think about it.  Let me quote one of my replies to Bjarni.

>>> What's weird are these behaviors:
>>>
>>> .ad c
>>> qrs tuv\p
>>> .na
>>> wxy zab\p
>>> .ad
>>> cde fgh\p
>>>
>>> ...which results in:
>>>
>>> (Once again I'll invite the reader to pause and test their command
>>> of *roff by predicting the output of the foregoing before
>>> proceeding.)
>>>
>>>  qrs tuv
>>> wxy zab
>>>  cde fgh
>>>
>>> So if "adjustment" is, as I claim, "the widening of the spaces
>>> between words until glyphs abut both the left and right margins",
>>> well, that's clearly not happening here.  But neither has
>>> "adjustment" "begun".
>>>
>>> Same with right-"adjustment":
>>>
>>> .ad r
>>> ijk lmn\p
>>> .na
>>> opq rst\p
>>> .ad
>>> uvw xyz\p
>>>
>>> ...which yields:
>>>
>>>   ijk lmn
>>> opq rst
>>>   uvw xyz
>>>
>>> One could infer from this that `ad` without arguments means "go back
>>> to what the previous 'adjustment' was", but as we both showed:
>>>
>>> .ad l
>>> mno pqr\p
>>> .na
>>> stu vwx\p
>>> .ad
>>> yza bcd\p
>>>
>>> ...which results in...
>>>
>>> mno pqr
>>> stu vwx
>>> yza   bcd
>>>
>>> ...defeats that claim handily.

"Invoking `ad` without argument means 'Go back to whatever the previous
adjustment mode was, unless it was left-alignment/no alignment, in which
case adjust to both margins.'."

That was bonkers.  I want to know who talked Joe Ossanna into that.

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2024-10/msg00107.html

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/05
- Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/12
  - Re: Do Latin-2-based hyphenation files work with Unicode?, G. Branden Robinson, 2024/11/13
    - Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/13
    - Re: Do Latin-2-based hyphenation files work with Unicode?, G. Branden Robinson, 2024/11/13
    - Re: Do Latin-2-based hyphenation files work with Unicode?, onf, 2024/11/13
    - character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), G. Branden Robinson <=
    - Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), onf, 2024/11/13
    - Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), G. Branden Robinson, 2024/11/13
    - Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), G. Branden Robinson, 2024/11/13
    - Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?), onf, 2024/11/14

Prev by Date: Best practice to create multi-line footer in letters?
Next by Date: Re: Best practice to create multi-line footer in letters?
Previous by thread: Re: Do Latin-2-based hyphenation files work with Unicode?
Next by thread: Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)
Index(es):
- Date
- Thread