an observation and proposal about hyphenation codes

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
an observation and proposal about hyphenation codes

From:	G. Branden Robinson
Subject:	an observation and proposal about hyphenation codes
Date:	Wed, 31 Jul 2024 21:23:13 -0500
Hi folks,

Dave and I have been discussing hyphenation codes extensively over the
past few days; see recent bug-groff list traffic.

There is much I am coming to understand about GNU troff's hyphenation
system, and I've discovered a salient fact that no one has complained
about (as far as I know), but which seems material to the problem and
should motivate either a change to data organization inside the
formatter, or reorganization of some of the stock macro files we ship.

First, what's a hyphenation code?

As of very recently, our Texinfo manual explains them as follows.

---snip---
   For automatic hyphenation to work, the formatter must know which
letters are equivalent; for example, the letter 'E' behaves like 'e';
only the latter typically appears in hyphenation pattern files.  GNU
'troff' expects characters that participate in automatic hyphenation to
be assigned "hyphenation codes" that define these equivalence classes.
At startup, GNU 'troff' assigns hyphenation codes to the letters
'a'-'z', applies the same codes to 'A'-'Z' in one-to-one correspondence,
and assigns a code of zero to all other characters.

   The 'hcode' request extends this principle to letters outside the
Unicode basic Latin alphabet; without it, words containing such letters
won't be hyphenated properly even if the corresponding hyphenation
patterns contain them.
---end snip---

A fact I found noteworthy about how GNU troff actually sets up
hyphenation codes is that the equivalence classes it is designed to
support _are almost never used_ beyond lettercase coalescence.[1]

Instead, in our localization files, every non-basic-Latin character gets
its own bucket, occupied by upper and lowercase forms, if both exist.

Here are all the hyphenation codes we set up in all our localization
files.

cs.tmac (Czech):
.hcode á á  Á á
.hcode č č  Č č
.hcode ď ď  Ď ď
.hcode é é  É é
.hcode ě ě  Ě ě
.hcode í í  Í í
.hcode ň ň  Ň ň
.hcode ó ó  Ó ó
.hcode ř ř  Ř ř
.hcode š š  Š š
.hcode ť ť  Ť ť
.hcode ú ú  Ú ú
.hcode ů ů  Ů ů
.hcode ý ý  Ý ý
.hcode ž ž  Ž ž

de.tmac (German):
.hcode ä ä  â â  à à  á á  ã ã  å å  æ æ
.hcode ç ç
.hcode é é  è è  ë ë  ê ê
.hcode í í  ì ì  î î  ï ï
.hcode ñ ñ
.hcode ó ó  ò ò  ô ô  ö ö  ø ø
.hcode ú ú  ü ü  û û
.
.hcode Ä ä  Â â  À à  Á á  Ã ã  Å å  Æ æ
.hcode Ç ç
.hcode É é  È è  Ë ë  Ê ê
.hcode Í í  Ì ì  Î î  Ï ï
.hcode Ñ ñ
.hcode Ó ó  Ò ò  Ô ô  Ö ö  Ø ø
.hcode Ú ú  Ü ü  Û û
.
.hcode ß ß

en.tmac (English): NONE

es.tmac (Spanish):
.hcode á á  Á á
.hcode é é  É é
.hcode í í  Í í
.hcode ó ó  Ó ó
.hcode ú ú  Ú ú
.hcode ñ ñ  Ñ ñ
.hcode ü ü  Ü ü

fr.tmac (French):
.hcode à à  À à
.hcode â â  Â â
.hcode ç ç  Ç ç
.hcode è è  È è
.hcode é é  É é
.hcode ê ê  Ê ê
.hcode ë ë  Ë ë
.hcode î î  Î î
.hcode ï ï  Ï ï
.hcode ô ô  Ô ô
.hcode ù ù  Ù ù
.hcode û û  Û û
.hcode ü ü  Ü ü
.hcode ÿ ÿ  Ÿ ÿ
.hcode œ œ  Œ œ

it.tmac (Italian): NONE

ru.tmac (Russian):
.hcode а а  А а
.hcode б б  Б б
.hcode в в  В в
.hcode г г  Г г
.hcode д д  Д д
.hcode е е  Е е
.hcode ё ё  Ё ё
.hcode ж ж  Ж ж
.hcode з з  З з
.hcode и и  И и
.hcode й й  Й й
.hcode л л  Л л
.hcode л л  Л л
.hcode м м  М м
.hcode н н  Н н
.hcode о о  О о
.hcode п п  П п
.hcode р р  Р р
.hcode с с  С с
.hcode т т  Т т
.hcode у у  У у
.hcode ф ф  Ф ф
.hcode х х  Х х
.hcode ц ц  Ц ц
.hcode ч ч  Ч ч
.hcode ш ш  Ш ш
.hcode щ щ  Щ щ
.hcode ъ ъ  Ъ ъ
.hcode ы ы  Ы ы
.hcode ь ь  Ь ь
.hcode э э  Э э
.hcode ю ю  Ю ю
.hcode я я  Я я

sv.tmac (Swedish):
.hcode å å  Å å
.hcode ä ä  Ä ä
.hcode ö ö  Ö ö
.hcode é é  É é

You will observe that most languages declare hyphenation codes only for
standard letters in their alphabets.  For example, Czech omits the
Polish letter ł, even though that letter is present in the ISO 8859-2
encoding that the localization file requires.

Except German.  German goes ahead and eats every letter in the Latin-1
supplement even though many of them are unknown in pure German
orthography.  (Any language can employ loan words, of course.)

And that suggests the direction of an implementation decision that I am
questioning.

In GNU troff, hyphenation codes are _global_.  They are not dependent on
the hyphenation _language_ selected with the `hla` request, and which is
a property of a GNU troff _environment_.

This means that if, in one language, "á" should be treated as "a" for
hyphenation purposes, but as its own letter in another language, GNU
troff will not be able hyphenate for both languages correctly in the
same "run", even if distinct environments are used carefully to typeset
each of the languages.

Here's how you'd make these distinct declarations.

.hcode á a \" á participates in hyphenation just like a
.hcode á á \" á behaves as a distinct letter in hyphenation

This does suggest a workaround, if a tedious one: the document author
must write some macro that reconfigures the hyphenation codes as needed
when switching environments.

That seems like a problem to me--or it would, except that no one has
complained about our hyphenation being bad for non-English languages.
(We do sometimes hear observations that hyphenation isn't performed _at
all_ for a language--like Hungarian.)

However, in the meantime, meaning for groff 1.24, I propose to move
`hcode` definitions to where they make more sense: the character set
macro files "koi8-r.tmac", "latin1.tmac", "latin2.tmac", and
"latin9.tmac".  (If/when I do that, I'll need to update the
"tmac/LOCALIZATION" file accordingly.)

Another problem you may observe is that hyphenation codes are an
additional barrier to mixing character sets; there will be incongruities
due to code collisions.  As I put it to (a heavily paraphrased) Dave in
a recent Savannah ticket:[2]

>> [[
>> Why can't we give a special character a bespoke hyphenation code,
>> like this?
>>
>> .hcode \['a] \['a]
>>
>> ...when we can do this just fine?
>>
>> .hcode á á
>> ]]
>
> Because the formatter doesn't know what [hyphenation code] value to
> give [the special character].  Under the hood, [a hyphenation code] is
> just a character code--in other words, on an ISO 8859 system, the
> hyphenation codes for 'a' through 'z' are 97 through 122--but our
> documentation stands on its head to avoid saying that.  The trouble is
> that there is a potentially larger space of sui generis special
> characters, by which I mean ones that don't belong to an equivalence
> class of a Basic Latin letter.  [... The] German Eszett [for example]
> is not.  If we had an Icelandic locale, thorn and eth would similarly
> have to have hyphenation codes above 127 decimal.
>
> The real fun comes when you add letters from multiple ISO 8859
> character sets [and KOI8-R, as I didn't mention].  Before long you're
> going to have collisions.
>
> So it's good that our documentation does the headstand.  We should not
> disclose what the hyphenation code values are, we need only to ensure
> that they sort into the correct equivalence classes, so that they then
> interoperate as desired with the hyphenation patterns.
>
> When we get support for UTF-8-encoded hyphenation pattern files,
> things will become straightforward again.
>
> In the meantime, what I think I will do is use a `static int` to mint
> a sequence number (starting at 256) for hyphenation codes any time a
> special character needs one sui generis.

For groff 1.25, if we get fully armed and operational UTF-8 input
reading into GNU troff, then part of this problem will go away.

If we continue to use character codes as hyphenation code values
(clandestinely), then the collision problem will disappear.

The equivalence class problem will not.

But what I don't know is how much of a problem the latter is in
practice.  It depends on how TeX handles the problem, since we use its
hyphenation patterns.  I can physically read the files but am not smart
enough to infer equivalence between accented and non-accented vowels
within them, for example.

Does anyone know?

Regards,
Branden

[1] "Almost never".  So what's an exception?

tmac/ps.tmac:

.fchar \[S ,] \o'S\[ac]'
.hcode \[S ,]s
.fchar \[s ,] \o's\[ac]'
.hcode \[s ,]s

A similar thing is done for some other accented letters that aren't in
ISO 8859-1.

Here, a user-defined special character is getting a hyphenation code,
but not one all its own.  It is sharing one with the letter "s".  In GNU
troff, you cannot assign a special character its own hyphenation code
equivalence class, though I am looking at relaxing that restriction, for
orthogonality and because it looks easy.

[2] ...but actually tugging on a loose thread identified 10 years ago by
    Carsten Kunze.

    https://savannah.gnu.org/bugs/?42870
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
an observation and proposal about hyphenation codes, G. Branden Robinson <=
Prev by Date: Re: Pygments-based syntax highlighting preprocessor
Previous by thread: removing the `de` macro and the old ".pl \n(nlu" trick
Index(es):
- Date
- Thread