bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: The hyphenation algorithm produces wrong results


From: Werner LEMBERG
Subject: Re: The hyphenation algorithm produces wrong results
Date: Sun, 04 Mar 2018 08:39:44 +0100 (CET)

> .ll 1n
> .hy 48

You *must not* use such values if the patterns don't allow it!  From
groff.texi:

  For historical reasons the default value of the @code{hy} request
  doesn't fit the American English hyphenation patterns that are used
  by groff as the default.  These patterns expect that neither the
  first character nor the last two characters are to be hyphenated;
  this corresponds to address@hidden  Consequently, @code{hy}'s default
  address@hidden or even setting values 16 address@hidden might lead to
  (additional) incorrect hyphenation points.

Anyway, I've now replaced this with

  The number of characters at the beginning of a word after which the
  first hyphenation point should be inserted is determined by the
  patterns themselves; it can't be reduced further without introducing
  additional, invalid hyphenation points (unfortunately, this
  information is not part of a pattern file, you have to know it in
  advance).  The same is true for the number of characters at the end
  of word before the last hyphenation point should be inserted.  For
  example, the code

  @Example
  .ll 1
  .hy 48
  @endExample

  returns

  @Example
  s-
  plit-
  t-
  in-
  g
  @endExample

  instead of the correct `split-ting'.  US-English patterns as
  distributed with groff need two characters at the beginning and
  three characters at the end; this means that address@hidden of
  @code{hy} is mandatory.  address@hidden is possible as an additional
  restriction, but address@hidden (the default!), 16, address@hidden
  should be avoided.

to clarify the issue even more.

> The algorithm
> 
> 1) uses pattern in the wrong places, at the beginning of a word
>    although no period is in the pattern

You have a too simplistic view how patterns work...

> 2) splits off one letter at the end although I found no corresponding
> pattern in the "hyphen.us" file.
>
>   splitting   s-plit-t-in-g

OK, let's look at the word `splitting', using the `patternize.lua'
demo program from the padrinoma project
(https://github.com/sh2d/padrinoma).

  > texlua patternize.lua -p hyphen.us -l 1 -t 1 -m 1 -v
  pattern file: hyphen.us (4555 patterns read)
  spot mins, special characters: 1 1 '-=.'

  splitting

   . s p l i t t i n g .
      1p2l2
         l1i t
            4t3t2
          4i t t
              2t1i n
   .0s1p2l4i4t3t2i0n0g0.
  s-plit-ting

As can be seen, the patterns themselves contain a breakpoint after the
leading `s' character!

However, your extreme line length settings make groff emit `can't
break line' warnings.  If groff does that, it apparently starts anew
with searching hyphenation points for the remaining substring.

   plitting

   . p l i t t i n g .
    1p2l2
       l1i t
          4t3t2
        4i t t
            2t1i n
   .0p2l4i4t3t2i0n0g0.
  plit-ting

  ting

   . t i n g .
   . t i2
    2t1i n
   . t i n g4
   .0t1i2n0g0.
  t-ing

  ing

   . i n g .
   . i n1
   .0i0n1g0.
  in-g

I'm not sure whether I should classify groff's behaviour of restarting
the hyphenation process a feature or a bug (I tend to the latter).
However, I don't have time to work on that.

> The cases '16' and '32' (for .hy) may not add hyphenation points,
> just allow already found ones, if otherwise forbidden.

Nice idea, but impossible to implement without meta-knowledge.  As
mentioned above, the hyphenation patterns are constructed with certain
\lefthyphenmin and \righthyphenmin values.  However, those values are
*not* present in the hyphenation patterns – you have to know them (I
consider this a design bug in TeX).  In other words, only the user
knows that values 16 or 32 are valid for a given language's
hyphenation patterns or not.

> [...]  So the algorithm has to be fixed ...

Definitely not. 

> ... and tested with ".hy 1" (the current stable version) and with
> ".hy 48" (development) to see if it works correctly according to the
> used hyphenation pattern file.

The algorithm works as expected, there is nothing to fix.  Barring
still hidden bugs, the problem *is* fixed.  It probably doesn't meet
your expectations, though :-)


    Werner

reply via email to

[Prev in Thread] Current Thread [Next in Thread]