bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: minor hyphenation issue


From: Werner LEMBERG
Subject: Re: minor hyphenation issue
Date: Wed, 12 Apr 2017 08:08:05 +0200 (CEST)

>> the basic ("knuthian") tex hyphenation algorithm does not handle
>> any words with diacritics, and that is what the us list is based
>> on.

In general, this is not a restriction since up to 256 characters are
allowed in `patgen', which is the ultimate program to generate
hyphenation patterns.  Non-English hyphenation patterns simply use
precomposed characters with diacritics; for example, the German
patterns now use the latin-9 character set.  The English patterns
could do exactly the same to allow stuff like `chef d'œuvre' (assuming
that this word could be hyphenated, which is probably not true :-).

The very issue is rather that *users* are not accomodated to select an
input and/or font encoding while typesetting US English texts.  The
only chance to improve that IMHO is to use TeX systems that natively
use UTF-8.  So groff has a slight advantage here over plain TeX since
it is set up by default to use latin-1.

Note, however, that noone takes care of the US patterns.  The most
recent version used in the `tex-hyphen' project at

  https://github.com/hyphenation/tex-hyphen

is from 1990!  In other words, the only `standardized' corrective is
Barbara's list...

> I see.  Werner (or anyone else familiar with the groff side of
> things), is this limitation also present in groff?  Or could groff's
> version of tmac/hyphenex.us be put into Latin-9 encoding to
> accommodate these words?

It could.  However, for the sake of maintainability, I strongly
suggest that `hyphenex.us' stays in sync with the original one edited
by Barbara.  You can always add new entries with the `.hw' request
(provided your setup correctly understands the corresponding encoding;
have a look how German is handled, for example).

>> i'm surprised that the encoding is (still?) listed as latin-* --
>> there has been an effort to support utf8, so i (perhaps rashly)
>> assumed that would be the base encoding.

groff cannot digest UTF-8 natively.  However, there are means to
automatically map UTF-8 to its internal representation, which usually
is latin-1, together with constructs like \[uXXXX] to access Unicode
encoded characters outside the selected encoding.

> http://git.savannah.gnu.org/gitweb/?p=groff.git;a=history;f=tmac/hyphenex.det;h=c74eebabff8e35353fdfb176a5c98df56c3e4ea0;hb=HEAD

`hyphenex.det' is no longer maintained – and now deleted from the
repository: I took the opportunity to completely update the German
hyphenation patterns, and this file is no longer needed.

> Their encodings on the TeX side may have been updated, and the
> changes never pulled to groff.

Today, almost all hyphenation patterns in the `tex-hyphen' repository
(and thus in the distribution from CTAN) are in UTF-8 encoding.

> In contrast (and probably because of this thread), groff's
> tmac/hyphenex.us was updated from TeX four days ago:

Exactly.

> This file does not specify any encoding, but its entire contents
> fall into 7-bit ASCII.

Well, the list simply doesn't contain any non-ASCII words...


    Werner

reply via email to

[Prev in Thread] Current Thread [Next in Thread]