[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Encoding issues : the DESCRIPTION file
From: |
Julien Bect |
Subject: |
Encoding issues : the DESCRIPTION file |
Date: |
Tue, 20 Jan 2015 14:27:47 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 |
Hello everyone,
I have started to investigate encoding issues in the generate_html
package, following this discussion :
http://octave.1599824.n4.nabble.com/generate-html-breaks-documentation-encoding-tp4668154.html
Before fixing anything, I am currently trying to see what the state of
affairs is...
I would like to discuss the specific case of the DESCRIPTION file first
(but others, such as NEWS or COPYING, raise similar issues). Reminder :
the content of DESCRIPTION is used to create "overview.html".
My question is: which encoding can be used, or should be used, in this
file ? Here are a few facts.
=== BEGIN FACTS ===
1) There is no mention of a specific encoding in the documentation
https://www.gnu.org/software/octave/doc/interpreter/Creating-Packages.html
and, to the best of my knowledge, it is not currently possible to
indicate which encoding is actually used in the DESCRIPTION file of a
given package (or more generally in all the files of a given package).
2) Currently, the "octave-forge" style in the generate_html package
currently assumes "iso-8859-1".
3) As far as I can tell, most packages only use US-ASCII (7 bits ASCII)
which is a proper subset of both ISO-8859-1 and UTF-8.
4) Some packages already use UTF-8 in their DESCRIPTION file. For
instance, there is an "ø" (C3B8, LATIN SMALL LETTER O WITH STROKE) in
the generate_html package and a "ë" (C3AB, LATIN SMALL LETTER E WITH
DIAERESIS) in the image package.
5) For the packages where DESCRIPTION contains UTF-8 characters, I
assume (sorry, not exactly a fact anymore) that the html produced by
generate_package_html () has been manually edited to replace
"charset=iso-8859-1" by "charset=utf-8". @Søren, Carnë: is that correct ?
=== END FACTS ===
I would like to come up with a solution that is clear and consistent for
the *automatic* processing of DESCRIPTION files (no more manual editing
should be needed).
Here are some options.
A) Assume US-ASCII. Error if any character > 0x7F is encountered.
A') Same as A, unless a optional ENCODING file is present, in which case
DESCRIPTION (and COPYING, and NEWS) is assumed to have the encoding
indicated in that file.
B) Assume ISO-8859-1. For "ø" and "ë" this wouldn't be a problem (F8 and
EB) but sooner or later a package manager whose name cannot be written
in ISO-8859-1 will join the project...
B') Assume ISO-8859-1 with an optional ENCODING file.
C) Assume UTF-8.
C') Assume UTF-8 with an optional ENCODING file (for package manager
that *really* don't want to use UTF-8).
D) In A', B' or C', use a new optional field in DESCRIPTION instead of
an ENCODING file.
I would vote for A' (just requires a small number of packager managers
to add an ENCODING file) or C (doesn't seem to require any additional
work at all).
Any thoughts ?
- Encoding issues : the DESCRIPTION file,
Julien Bect <=