[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] bullets render as question marks
From: |
Aaron Davies |
Subject: |
Re: [Groff] bullets render as question marks |
Date: |
Mon, 30 Nov 2015 20:00:06 -0500 |
On Nov 30, 2015, at 5:40 PM, Ingo Schwarze <address@hidden> wrote:
> Aaron Davies wrote on Mon, Nov 30, 2015 at 03:04:20PM -0500:
>
>> On Nov 30, 2015, at 12:52 PM, Ingo Schwarze <address@hidden> wrote:
>
>>> Aaron Davies wrote on Mon, Nov 30, 2015 at 12:38:13PM -0500:
>>>
>>>> \(bu bullets in man pages are rendering as question marks under
>>>> default settings for me
>>>
>>> * Which version of groff are you running?
>>
>> on RHEL 6.7, 1.18.1.4; on RHEL 5.11, 1.18.1.1
>
> Oh wow. Those are extremely old versions of groff.
> I haven't seen versions that old in production since 2009.
>
> Groff 1.19 came out in 2004, so what you are running is more than
> a decade out of date.
>
>> $ locale
>> LANG=en_US.UTF-8
>> LC_ALL=C
>> LC_CTYPE="C"
>
> Hmm, i missed that GNU nroff(1) doesn't call setlocale(3), but
> calls locale(1) in a different way, and, failing that, inspects
> the environment directly. Can you send the output of the following,
> too, to figure out which output device is actually running?
>
> $ locale charmap
> $ env | grep -e LC_ALL -e LC_CTYPE
>
> I suspect that you are running in -Tascii mode, but i'm not 100%
> sure yet, it could still be -Tutf8.
$ locale charmap
ANSI_X3.4-1968
$ env - locale charmap
ANSI_X3.4-1968
$ env | grep -e LC_ALL -e LC_CTYPE
LC_ALL=C
$ env - env | grep -e LC_ALL -e LC_CTYPE
$
>> $ echo '\(bu' | nroff | hexdump -C
>> 00000000 3f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a
>> |?...............|
>
> So, your problem is probably not related to man(1) or to the terminal,
> but groff is actually producing wrong output.
>
> For -Tascii output, in modern groff, the \(bu character
> is defined by this statement in tmac/tty.tmac:
>
> .fchar \[bu] \z+o
>
> In the source code repository, that line is present since 2002-03-24.
> The file is usually loaded by these lines in troffrc:
>
> .do ds troffrc!ascii tty.tmac
> .do if d troffrc!\*[.T] \
> . do mso \*[troffrc!\*[.T]]
>
> So, can you please look at your file tmac/troffrc (usually located
> in /usr/local/share/groff/<version>/ or a similar place) whether
> it indeed includes tty.tmac, and can you look into your tty.tmac
> (in the same directory) whether and how it defines \[bu]?
looks identical to your expected lines (checked on RHEL 6.7 only):
$ fgrep -e '.do ds troffrc!ascii tty.tmac' -e '.do if d troffrc!\*[.T] \' -e
$'.\tdo mso \\*[troffrc!\\*[.T]]' /usr/share/groff/1.18.1.4/tmac/troffrc
.do ds troffrc!ascii tty.tmac
.do if d troffrc!\*[.T] \
. do mso \*[troffrc!\*[.T]]
$ fgrep '.fchar \[bu]' /usr/share/groff/1.18.1.4/tmac/tty.tmac
.fchar \[bu] \z+o
$
> You can also try specifying the device explicitly:
>
> $ echo '\(bu' | groff -Tascii | hexdump -C
> $ echo '\(bu' | groff -mtty -Tascii | hexdump -C
> $ echo '\(bu' | groff -Tutf8 | hexdump -C
> $ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
>
> Which output do those commands produce?
$ echo '\(bu' | groff -Tascii | hexdump -C
00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............|
00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
*
00000040 0a 0a 0a 0a 0a |.....|
00000045
$ echo '\(bu' | groff -mtty -Tascii | hexdump -C
00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............|
00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
*
00000040 0a 0a 0a 0a 0a |.....|
00000045
$ echo '\(bu' | groff -Tutf8 | hexdump -C
00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
*
00000040 0a 0a 0a 0a |....|
00000044
$ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
*
00000040 0a 0a 0a 0a |....|
00000044
$
> For UTF-8 tty output, in modern groff, \(bu is defined by the
> following line in src/libs/libgroff/glyphuni.cpp:
>
> { "bu", "2022" },
>
> That line is unchanged in the source code repo since 2002-11-03.
> So the UTF-8 modes should work in any case, even without -mtty,
> even with your ancient version of groff.
>
>> on RHEL 6.7:
>>
>> $ grep ^NROFF /etc/man.config
>> NROFF /usr/bin/nroff -c -mandoc 2>/dev/null
>
> That looks reasonable.
>
>> on RHEL 5.11:
>>
>> $ grep ^NROFF /etc/man.config
>> NROFF /usr/bin/nroff -c --legacy NROFF_OLD_CHARSET -mandoc
>> 2>/dev/null
>
> I don't know what "--legacy NROFF_OLD_CHARSET" is supposed to mean,
> but since you see the problem on RHEL 6.7 as well, it's doesn't seem
> likely to be the cause of the problem.
it looks like it might be passed to iconv under certain circumstances, but some
fiddling around with a copy of the nroff script suggests that isn't actually
happening, so yes, it looks like that's not relevant
i think i can reconstruct the final rendering command now:
/usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c
-mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit
here's the output after the first stage of the pipeline:
$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE
.if !\n(.g .ab GNU tbl requires GNU troff.
.if !dTS .ds TS
.if !dTE .ds TE
.lf 1 -
.pl 11i
.
.TH "foo" "1" "December 2015" "" ""
.
.SH "NAME"
\fBfoo\fR
.
.SH "SYNOPSIS"
\fBfoo\fR
.
.IP "\(bu" 4
\fB\-\-baz\fR:
.
.IP
Option baz\.
.
$
and the first two stages:
$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c
-mandoc -Tutf8|hexdump -C
00000000 66 6f 6f 28 31 29 20 20 20 20 20 20 20 20 20 20 |foo(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
*
00000040 20 20 20 20 20 20 20 20 66 6f 6f 28 31 29 0a 0a | foo(1)..|
00000050 0a 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 |..N.NA.AM.ME.E. |
00000060 20 20 20 20 20 20 66 08 66 6f 08 6f 6f 08 6f 0a | f.fo.oo.o.|
00000070 0a 53 08 53 59 08 59 4e 08 4e 4f 08 4f 50 08 50 |.S.SY.YN.NO.OP.P|
00000080 53 08 53 49 08 49 53 08 53 0a 20 20 20 20 20 20 |S.SI.IS.S. |
00000090 20 66 08 66 6f 08 6f 6f 08 6f 0a 0a 20 20 20 20 | f.fo.oo.o.. |
000000a0 20 20 20 c2 b7 20 20 20 2d 08 2d 2d 08 2d 62 08 | .. -.--.-b.|
000000b0 62 61 08 61 7a 08 7a 3a 0a 0a 20 20 20 20 20 20 |ba.az.z:.. |
000000c0 20 20 20 20 20 4f 70 74 69 6f 6e 20 62 61 7a 2e | Option baz.|
000000d0 0a 0a 0a 0a 20 20 20 20 20 20 20 20 20 20 20 20 |.... |
000000e0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
000000f0 20 20 20 20 20 44 65 63 65 6d 62 65 72 20 32 30 | December 20|
00000100 31 35 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |15 |
00000110 20 20 20 20 20 20 20 20 20 20 20 20 66 6f 6f 28 | foo(|
00000120 31 29 0a |1).|
00000123
$
which renders as what i presume is supposed to be a unicode bullet (though it's
a bit hard to tell in any of my terminal fonts):
$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c
-mandoc -Tutf8
foo(1) foo(1)
NAME
foo
SYNOPSIS
foo
· --baz:
Option baz.
December 2015 foo(1)
$
though i can't figure out how 0xc2b7 corresponds to codepoint 2022
the final stage, where iconv is supposed to convert it to ascii, is where the
bullet turns into a question mark:
$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c
-mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit|hexdump -C
00000000 66 6f 6f 28 31 29 20 20 20 20 20 20 20 20 20 20 |foo(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
*
00000040 20 20 20 20 20 20 20 20 66 6f 6f 28 31 29 0a 0a | foo(1)..|
00000050 0a 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 |..N.NA.AM.ME.E. |
00000060 20 20 20 20 20 20 66 08 66 6f 08 6f 6f 08 6f 0a | f.fo.oo.o.|
00000070 0a 53 08 53 59 08 59 4e 08 4e 4f 08 4f 50 08 50 |.S.SY.YN.NO.OP.P|
00000080 53 08 53 49 08 49 53 08 53 0a 20 20 20 20 20 20 |S.SI.IS.S. |
00000090 20 66 08 66 6f 08 6f 6f 08 6f 0a 0a 20 20 20 20 | f.fo.oo.o.. |
000000a0 20 20 20 3f 20 20 20 2d 08 2d 2d 08 2d 62 08 62 | ? -.--.-b.b|
000000b0 61 08 61 7a 08 7a 3a 0a 0a 20 20 20 20 20 20 20 |a.az.z:.. |
000000c0 20 20 20 20 4f 70 74 69 6f 6e 20 62 61 7a 2e 0a | Option baz..|
000000d0 0a 0a 0a 20 20 20 20 20 20 20 20 20 20 20 20 20 |... |
000000e0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
000000f0 20 20 20 20 44 65 63 65 6d 62 65 72 20 32 30 31 | December 201|
00000100 35 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |5 |
00000110 20 20 20 20 20 20 20 20 20 20 20 66 6f 6f 28 31 | foo(1|
00000120 29 0a |).|
00000122
$
interestingly, if i output the standard UTF-8 representation of 2022 to the
terminal, i get a different bullet:
$ echo $'\xe2\x80\xa2'
•
$
and if i substitute that in for the 0xc2b7 sequence, i get a reasonable ascii
transliteration out of the final iconv stage:
$ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char -P-c
-mandoc -Tutf8|sed -r $'s/\xc2\xb7/\xe2\x80\xa2/'|/usr/bin/iconv -f utf-8 -t
ANSI_X3.4-1968//translit
foo(1) foo(1)
NAME
foo
SYNOPSIS
foo
o --baz:
Option baz.
December 2015 foo(1)
$
so it looks like the real problem is that the groff stage in the middle of the
pipeline isn't generating a unicode output that iconv understands how to
transliterate
incidentally, the manpage where i originally discovered this issue has a
similar problem with \', but as it's using that to represent an apostrophe,
which is wrong, that's not really your problem
(the odd thing is that groff is rendering that as 0xc2b4, which AFAICT *is*
correct UTF-8, so maybe my iconv just doesn't have a transliteration rule for
it?)
--
Aaron Davies
address@hidden