[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] bullets render as question marks
From: |
Ingo Schwarze |
Subject: |
Re: [Groff] bullets render as question marks |
Date: |
Tue, 1 Dec 2015 03:01:16 +0100 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
Hi Aaron,
Aaron Davies wrote on Mon, Nov 30, 2015 at 08:00:06PM -0500:
> $ locale charmap
> ANSI_X3.4-1968
Heh. I didn't see that name for ASCII before and had to look it
up to learn what it means. :)
The GNU nroff(1) script never heard about that name for ASCII
either, so it falls back to LC_ALL, and since that is "C",
it falls back further to $LESSCHARSET. Is $LESSCHARSET defined
on your system, and if so, what is its value?
> $ echo '\(bu' | groff -Tascii | hexdump -C
> 00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............|
[...]
> $ echo '\(bu' | groff -mtty -Tascii | hexdump -C
> 00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............|
[...]
That's both correct.
> $ echo '\(bu' | groff -Tutf8 | hexdump -C
> 00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
[...]
> $ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
> 00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
[...]
0xc2 - 0xc0 = 0x02; 0x0200 >> 2 = 0x80
0xb7 - 0x80 = 0x37; 0x37 + 0x80 = 0xb7
So that is U+00B7, MIDDLE DOT.
Perhaps not the best possible choice, but not wrong either, and
certainly valid UTF-8.
> i think i can reconstruct the final rendering command now:
>
> /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char \
> -P-c -mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit
That is ugly. Someone is trying to sell you hardware? Trying to
burn as many processor cycles as possible, and then some more? ;-)
Apart from the obvious contortion, it doesn't even look right. If the
input would really contain non-ASCII characters, the first iconv stage
would be insufficient, the pipeline would need preconv(1) before
groff.
I suggest fixing your system to simply do this instead:
/usr/bin/groff -P-c -mandoc -Tascii
That's all that is needed. Even the -mtty-char is redundant,
your troffrc includes tty.tmac anyway.
> and the first two stages:
>
> $ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE | \
> /usr/bin/groff -mtty-char -P-c -mandoc -Tutf8|hexdump -C
[...]
> 000000a0 20 20 20 c2 b7 20 20 20 2d 08 2d 2d 08 2d 62 08 | .. -.--.-b.|
[...]
>
> which renders as what i presume is supposed to be a unicode bullet
> (though it's a bit hard to tell in any of my terminal fonts):
MIDDLE DOT, not BULLET.
> though i can't figure out how 0xc2b7 corresponds to codepoint 2022
It doesn't, it's U+00B7, \(md, see the fallback in tty.tmac.
> the final stage, where iconv is supposed to convert it to ascii,
> is where the bullet turns into a question mark:
Yeah, so your's is an iconv(1) problem, not a groff(1) problem.
In this case, iconv(1) looks like a solution in search of a problem,
and like a broken solution, too. :-/
Just get rid of the pointless iconv(1), and you are fine.
> so it looks like the real problem is that the groff stage in the
> middle of the pipeline isn't generating a unicode output that iconv
> understands how to transliterate
No, groff output is valid, i don't know why iconv(1) fails to handle it.
> incidentally, the manpage where i originally discovered this issue
> has a similar problem with \', but as it's using that to represent
> an apostrophe, which is wrong, that's not really your problem
True, that's an acute accent.
> (the odd thing is that groff is rendering that as 0xc2b4, which
> AFAICT *is* correct UTF-8, so maybe my iconv just doesn't have a
> transliteration rule for it?)
Yes, 0xc2b4 = U+00B4 = ACUTE ACCENT, that is fine.
Indeed, you need to debug iconv, not groff.
Yours,
Ingo