texinfo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new command @U{nnnn}?


From: Patrice Dumas
Subject: Re: new command @U{nnnn}?
Date: Sat, 22 Nov 2014 17:28:22 +0100
User-agent: Mutt/1.5.20 (2009-12-10)

On Tue, Nov 04, 2014 at 05:37:50PM +0000, Karl Berry wrote:
> Patrice and all,
> 
> Anyway, so HTML, XML, and Docbook should be easy (just output &#x, and
> if it doesn't work, not our problem).

It is not clearly not our problem as we output an encoding based on the
@documentencoding, and one could consider that it is our problem to make
sure that all the entities we output are consistent with this encoding. 
But I also agree that "Going to great lengths to analyze whether the
NNNN is part of the @documentencoding does not seem warranted".  

> The question is Info and plain text (when NNNN > 7f).  I'm not sure if
> the Perl libraries you're already using give us anything useful in this
> regard.  Going to great lengths to analyze whether the NNNN is part of
> the @documentencoding does not seem warranted.  Creating a new command
> to separate the (input) @documentencoding from the output encoding (and
> then forcing the output to UTF-8), though perhaps useful for other
> reasons, seems like a tremendous effort, essentially reimplementing
> iconv, etc.  I conjecture that it would be ok to output UTF-8 if no
> @documentencoding is given, and otherwise just output some string, e.g.,
> the literal ascii characters "U+NNNN".  Plenty of other behaviors are
> possible too, of course.

Actually, the input encoding and output encoding are already separated.
There are quite a few customization variables related to that, like 
INPUT_ENCODING_NAME
INPUT_PERL_ENCODING
OUTPUT_ENCODING_NAME
and the undocumented (at least in texinfo.texi) OUTPUT_PERL_ENCODING.

The basic idea is that perl only operates on unicode.  All the input are
converted upon reading and presented/used as unicode, and the output is
converted to the output encoding upon writing.  If perl cannot decode
the texi file using the input encoding or cannot encode the result in
the output encoding it gives a cryptic warning (there is a thread/bug on
that in bug-texinfo), with the issue that I have no idea if it is easy
to catch those errors to say something more interesting mentionning
@documentencoding.

Unless I am mistaken, the default input encoding used by perl when there
is no documentencoding is based on the locale (though even if I set the
locale at "C", the utf8 encoded file is read correctly).  The default
output encoding is set to utf-8 for docbook, html and xml.  For
plaintext and info, I am not sure, but it is certainly the same as the
input encoding.  @documentencoding sets the output encoding to the
@documentencoding for Info and HTML.

In Plaintext/Info the @documentencoding is important even if perl uses a
given encoding as, for instance, accented commands are transliterated if
there is no @documentencoding (or encoding is us-ascii), for example

 @'e 

becomes

 'e

while with @documentencoding utf-8, it becomes é.


>From an implementation point of view, implementing @U{NNNN} would be rather
trivial even for Plaintext and Info if we assume that the user only uses
unicode points that perl can output in the output encoding (default or
set by @documentencoding), it would indeed just amount to replacing it
by "\xNNNN".  In HTML and XML, to be consistent, one should take into
account ENABLE_ENCODING_USE_ENTITY to use "&NNNN;" or "\xNNNN".  If we
want to make sure that the output is correct and perl won't output a
warning, it may be less easy and I don't even know if it is possible.


-- 
Pat



reply via email to

[Prev in Thread] Current Thread [Next in Thread]