groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: neatroff for Russian. (Was: Questions concerning hyphenation pattern


From: Oliver Corff
Subject: Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)
Date: Mon, 1 May 2023 22:19:32 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1

Hi Branden,

On 30/04/2023 15:35, G. Branden Robinson wrote:
At 2023-04-29T21:38:53-0500, Dave Kemper wrote:
On 4/29/23, Oliver Corff <oliver.corff@email.de> wrote:
Would it be a feasible option to use UTF-8 throughout the inner
workings of a future groff,
I'm going to phrase this more confrontationally than it needs to be just
to make a point about software design:
No need for apologies. We are discussing principles of work here.
It's none of your business what data type groff uses for characters in
its _inner workings_.

Of course I mean that purely from the software-architectural
perspective.  There is no reason for anyone except groff's developers to
care what primitive data type groff uses for this purpose as long as it
behaves correctly and is performant.  The whole point of encapsulation
is to keep other software modules from having to worry about this sort
of thing.
If it hisses like a utf8-duck, quacks like a utf8-duck and croaks like a
utf8-duck, it is a utf8-duck.
In another sense, it's totally your business and you can look at the
implementation at any time--it's Free Software.  But other software,
including parts of groff that are not GNU troff, the formatter, should
keep its dirty nose out, and expect to be excluded through
language-imposed visibility restrictions (or the impermeable wall of the
Unix process structure).

We absolutely want good UTF-8 support at the _edges_ of the system.  We
want to change GNU troff to cheerfully and correctly interpret UTF-8
input.  And we want output drivers that target devices using UTF-8 as a
character encoding to reliably produce it.

But that's all.
Consider my perspective to be a projection from a known surface to an
unknown core.

This is the topic of http://savannah.gnu.org/bugs/?40720

Only recently, I started to discover the treasure trove of information
to be unearthed from Savannah (it took my quite a while to grasp its
significance).

[...]

A rough sketch of my plan is this:

1.  Ensure that the groff string class is well-encapsulated.
2.  Change the internal type, and constructors and output functions
     only, to perform is transformation on this new type.
3.  Verify that nothing broke.  (If I did 1 and 2 correctly, nothing
     will.)
4.  Remap the code points we're squatting on.  Haven't decided yet
     whether to map them to illegal Unicode code points or to the Unicode
     Private Use Area.  With a char32_t we have all the room in the
     world.
5.  Drop code page 1047 support, per recent discussions with Mike Fulton
     of IBM on this list.
6.  Start not merely accepting, but _assuming_ UTF-8 input, because we
     won't misinterpret C1 controls anymore.

If that doesn't sound like enough work--at some point in the above, each
and every preprocessor has to be checked to ensure it isn't screwing up
the input before it gets to the formatter.
Whenever I can assist with test cases and data files, also for
preprocessors like tbl, please let me know.
I don't see getting rid of preconv(1) in the near term.  It will remain
useful, particularly if I add the couple of small features I had in mind
for it.  It may continue to play a role in getting input into the
correct Unicode Normalization Form (D).  It might make sense to leave
that business out of the formatter proper.

Regards,
Branden

Best regards,

Oliver.

--
Dr. Oliver Corff
mailto:oliver.corff@email.de




reply via email to

[Prev in Thread] Current Thread [Next in Thread]