Re: [Groff] bullets render as question marks

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] bullets render as question marks

From:	Aaron Davies
Subject:	Re: [Groff] bullets render as question marks
Date:	Mon, 30 Nov 2015 20:00:06 -0500

On Nov 30, 2015, at 5:40 PM, Ingo Schwarze <address@hidden> wrote:

> Aaron Davies wrote on Mon, Nov 30, 2015 at 03:04:20PM -0500:
> 
>> On Nov 30, 2015, at 12:52 PM, Ingo Schwarze <address@hidden> wrote:
> 
>>> Aaron Davies wrote on Mon, Nov 30, 2015 at 12:38:13PM -0500:
>>> 
>>>> \(bu bullets in man pages are rendering as question marks under
>>>> default settings for me
>>> 
>>> * Which version of groff are you running?
>> 
>> on RHEL 6.7, 1.18.1.4; on RHEL 5.11, 1.18.1.1
> 
> Oh wow.  Those are extremely old versions of groff.
> I haven't seen versions that old in production since 2009.
> 
> Groff 1.19 came out in 2004, so what you are running is more than
> a decade out of date.
> 
>> $ locale
>> LANG=en_US.UTF-8
>> LC_ALL=C
>> LC_CTYPE="C"
> 
> Hmm, i missed that GNU nroff(1) doesn't call setlocale(3), but
> calls locale(1) in a different way, and, failing that, inspects
> the environment directly.  Can you send the output of the following,
> too, to figure out which output device is actually running?
> 
>  $ locale charmap
>  $ env | grep -e LC_ALL -e LC_CTYPE
> 
> I suspect that you are running in -Tascii mode, but i'm not 100%
> sure yet, it could still be -Tutf8.

$ locale charmap
ANSI_X3.4-1968
$ env - locale charmap
ANSI_X3.4-1968
$ env | grep -e LC_ALL -e LC_CTYPE
LC_ALL=C
$ env - env | grep -e LC_ALL -e LC_CTYPE
$ 

>> $ echo '\(bu' | nroff | hexdump -C
>> 00000000  3f 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  
>> |?...............|
> 
> So, your problem is probably not related to man(1) or to the terminal,
> but groff is actually producing wrong output.
> 
> For -Tascii output, in modern groff, the \(bu character
> is defined by this statement in tmac/tty.tmac:
> 
>  .fchar \[bu] \z+o
> 
> In the source code repository, that line is present since 2002-03-24.
> The file is usually loaded by these lines in troffrc:
> 
>  .do ds troffrc!ascii tty.tmac
>  .do if d troffrc!\*[.T] \
>  .  do mso \*[troffrc!\*[.T]]
> 
> So, can you please look at your file tmac/troffrc (usually located
> in /usr/local/share/groff/<version>/ or a similar place) whether
> it indeed includes tty.tmac, and can you look into your tty.tmac
> (in the same directory) whether and how it defines \[bu]?

looks identical to your expected lines (checked on RHEL 6.7 only):

$ fgrep -e '.do ds troffrc!ascii tty.tmac' -e '.do if d troffrc!\*[.T] \' -e 
$'.\tdo mso \\*[troffrc!\\*[.T]]' /usr/share/groff/1.18.1.4/tmac/troffrc
.do ds troffrc!ascii tty.tmac
.do if d troffrc!\*[.T] \
.       do mso \*[troffrc!\*[.T]]
$ fgrep '.fchar \[bu]' /usr/share/groff/1.18.1.4/tmac/tty.tmac
.fchar \[bu] \z+o
$ 

> You can also try specifying the device explicitly:
> 
>  $ echo '\(bu' | groff -Tascii | hexdump -C
>  $ echo '\(bu' | groff -mtty -Tascii | hexdump -C
>  $ echo '\(bu' | groff -Tutf8 | hexdump -C
>  $ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
> 
> Which output do those commands produce?

$ echo '\(bu' | groff -Tascii | hexdump -C
00000000  2b 08 6f 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |+.o.............|
00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
*
00000040  0a 0a 0a 0a 0a                                    |.....|
00000045
$ echo '\(bu' | groff -mtty -Tascii | hexdump -C
00000000  2b 08 6f 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |+.o.............|
00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
*
00000040  0a 0a 0a 0a 0a                                    |.....|
00000045
$ echo '\(bu' | groff -Tutf8 | hexdump -C
00000000  c2 b7 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
*
00000040  0a 0a 0a 0a                                       |....|
00000044
$ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
00000000  c2 b7 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
*
00000040  0a 0a 0a 0a                                       |....|
00000044
$ 

> For UTF-8 tty output, in modern groff, \(bu is defined by the
> following line in src/libs/libgroff/glyphuni.cpp:
> 
>  { "bu", "2022" },
> 
> That line is unchanged in the source code repo since 2002-11-03.
> So the UTF-8 modes should work in any case, even without -mtty,
> even with your ancient version of groff.
> 
>> on RHEL 6.7:
>> 
>> $ grep ^NROFF /etc/man.config
>> NROFF                /usr/bin/nroff -c -mandoc 2>/dev/null
> 
> That looks reasonable.
> 
>> on RHEL 5.11:
>> 
>> $ grep ^NROFF /etc/man.config
>> NROFF                /usr/bin/nroff -c --legacy NROFF_OLD_CHARSET -mandoc 
>> 2>/dev/null
> 
> I don't know what "--legacy NROFF_OLD_CHARSET" is supposed to mean,
> but since you see the problem on RHEL 6.7 as well, it's doesn't seem
> likely to be the cause of the problem.

it looks like it might be passed to iconv under certain circumstances, but some 
fiddling around with a copy of the nroff script suggests that isn't actually 
happening, so yes, it looks like that's not relevant

i think i can reconstruct the final rendering command now:

/usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c 
-mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit

here's the output after the first stage of the pipeline:

$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE
.if !\n(.g .ab GNU tbl requires GNU troff.
.if !dTS .ds TS
.if !dTE .ds TE
.lf 1 -
.pl 11i
.
.TH "foo" "1" "December 2015" "" ""
.
.SH "NAME"
\fBfoo\fR
.
.SH "SYNOPSIS"
\fBfoo\fR
.
.IP "\(bu" 4
\fB\-\-baz\fR:
.
.IP
Option baz\.
.
$ 

and the first two stages:

$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c 
-mandoc -Tutf8|hexdump -C
00000000  66 6f 6f 28 31 29 20 20  20 20 20 20 20 20 20 20  |foo(1)          |
00000010  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00000040  20 20 20 20 20 20 20 20  66 6f 6f 28 31 29 0a 0a  |        foo(1)..|
00000050  0a 0a 4e 08 4e 41 08 41  4d 08 4d 45 08 45 0a 20  |..N.NA.AM.ME.E. |
00000060  20 20 20 20 20 20 66 08  66 6f 08 6f 6f 08 6f 0a  |      f.fo.oo.o.|
00000070  0a 53 08 53 59 08 59 4e  08 4e 4f 08 4f 50 08 50  |.S.SY.YN.NO.OP.P|
00000080  53 08 53 49 08 49 53 08  53 0a 20 20 20 20 20 20  |S.SI.IS.S.      |
00000090  20 66 08 66 6f 08 6f 6f  08 6f 0a 0a 20 20 20 20  | f.fo.oo.o..    |
000000a0  20 20 20 c2 b7 20 20 20  2d 08 2d 2d 08 2d 62 08  |   ..   -.--.-b.|
000000b0  62 61 08 61 7a 08 7a 3a  0a 0a 20 20 20 20 20 20  |ba.az.z:..      |
000000c0  20 20 20 20 20 4f 70 74  69 6f 6e 20 62 61 7a 2e  |     Option baz.|
000000d0  0a 0a 0a 0a 20 20 20 20  20 20 20 20 20 20 20 20  |....            |
000000e0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
000000f0  20 20 20 20 20 44 65 63  65 6d 62 65 72 20 32 30  |     December 20|
00000100  31 35 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |15              |
00000110  20 20 20 20 20 20 20 20  20 20 20 20 66 6f 6f 28  |            foo(|
00000120  31 29 0a                                          |1).|
00000123
$ 

which renders as what i presume is supposed to be a unicode bullet (though it's 
a bit hard to tell in any of my terminal fonts):

$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c 
-mandoc -Tutf8
foo(1)                                                                  foo(1)



NAME
       foo

SYNOPSIS
       foo

       ·   --baz:

           Option baz.



                                 December 2015                          foo(1)
$ 

though i can't figure out how 0xc2b7 corresponds to codepoint 2022

the final stage, where iconv is supposed to convert it to ascii, is where the 
bullet turns into a question mark:

$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c 
-mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit|hexdump -C
00000000  66 6f 6f 28 31 29 20 20  20 20 20 20 20 20 20 20  |foo(1)          |
00000010  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00000040  20 20 20 20 20 20 20 20  66 6f 6f 28 31 29 0a 0a  |        foo(1)..|
00000050  0a 0a 4e 08 4e 41 08 41  4d 08 4d 45 08 45 0a 20  |..N.NA.AM.ME.E. |
00000060  20 20 20 20 20 20 66 08  66 6f 08 6f 6f 08 6f 0a  |      f.fo.oo.o.|
00000070  0a 53 08 53 59 08 59 4e  08 4e 4f 08 4f 50 08 50  |.S.SY.YN.NO.OP.P|
00000080  53 08 53 49 08 49 53 08  53 0a 20 20 20 20 20 20  |S.SI.IS.S.      |
00000090  20 66 08 66 6f 08 6f 6f  08 6f 0a 0a 20 20 20 20  | f.fo.oo.o..    |
000000a0  20 20 20 3f 20 20 20 2d  08 2d 2d 08 2d 62 08 62  |   ?   -.--.-b.b|
000000b0  61 08 61 7a 08 7a 3a 0a  0a 20 20 20 20 20 20 20  |a.az.z:..       |
000000c0  20 20 20 20 4f 70 74 69  6f 6e 20 62 61 7a 2e 0a  |    Option baz..|
000000d0  0a 0a 0a 20 20 20 20 20  20 20 20 20 20 20 20 20  |...             |
000000e0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
000000f0  20 20 20 20 44 65 63 65  6d 62 65 72 20 32 30 31  |    December 201|
00000100  35 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |5               |
00000110  20 20 20 20 20 20 20 20  20 20 20 66 6f 6f 28 31  |           foo(1|
00000120  29 0a                                             |).|
00000122
$ 

interestingly, if i output the standard UTF-8 representation of 2022 to the 
terminal, i get a different bullet:

$ echo $'\xe2\x80\xa2'
•
$ 

and if i substitute that in for the 0xc2b7 sequence, i get a reasonable ascii 
transliteration out of the final iconv stage:

$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c 
-mandoc -Tutf8|sed -r $'s/\xc2\xb7/\xe2\x80\xa2/'|/usr/bin/iconv -f utf-8 -t 
ANSI_X3.4-1968//translit
foo(1)                                                                  foo(1)



NAME
       foo

SYNOPSIS
       foo

       o   --baz:

           Option baz.



                                 December 2015                          foo(1)
$ 

so it looks like the real problem is that the groff stage in the middle of the 
pipeline isn't generating a unicode output that iconv understands how to 
transliterate

incidentally, the manpage where i originally discovered this issue has a 
similar problem with \', but as it's using that to represent an apostrophe, 
which is wrong, that's not really your problem

(the odd thing is that groff is rendering that as 0xc2b4, which AFAICT *is* 
correct UTF-8, so maybe my iconv just doesn't have a transliteration rule for 
it?)
-- 
Aaron Davies
address@hidden

[Prev in Thread]

Current Thread

[Next in Thread]

[Groff] bullets render as question marks, Aaron Davies, 2015/11/30
- Re: [Groff] bullets render as question marks, Ingo Schwarze, 2015/11/30
  - Re: [Groff] bullets render as question marks, Aaron Davies, 2015/11/30
    - Re: [Groff] bullets render as question marks, Ingo Schwarze, 2015/11/30
    - Re: [Groff] bullets render as question marks, Damian McGuckin, 2015/11/30
    - Re: [Groff] bullets render as question marks, Aaron Davies <=
    - Re: [Groff] bullets render as question marks, Ingo Schwarze, 2015/11/30

Prev by Date: Re: [Groff] bullets render as question marks
Next by Date: Re: [Groff] bullets render as question marks
Previous by thread: Re: [Groff] bullets render as question marks
Next by thread: Re: [Groff] bullets render as question marks
Index(es):
- Date
- Thread