Encoding issues : the DESCRIPTION file

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Encoding issues : the DESCRIPTION file

From:	Julien Bect
Subject:	Encoding issues : the DESCRIPTION file
Date:	Tue, 20 Jan 2015 14:27:47 +0100
User-agent:	Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

Hello everyone,

I have started to investigate encoding issues in the generate_htmlpackage, following this discussion :


http://octave.1599824.n4.nabble.com/generate-html-breaks-documentation-encoding-tp4668154.html

Before fixing anything, I am currently trying to see what the state ofaffairs is...

I would like to discuss the specific case of the DESCRIPTION file first(but others, such as NEWS or COPYING, raise similar issues). Reminder :the content of DESCRIPTION is used to create "overview.html".

My question is: which encoding can be used, or should be used, in thisfile ? Here are a few facts.


=== BEGIN FACTS ===

1) There is no mention of a specific encoding in the documentation

https://www.gnu.org/software/octave/doc/interpreter/Creating-Packages.html

and, to the best of my knowledge, it is not currently possible toindicate which encoding is actually used in the DESCRIPTION file of agiven package (or more generally in all the files of a given package).

2) Currently, the "octave-forge" style in the generate_html packagecurrently assumes "iso-8859-1".

3) As far as I can tell, most packages only use US-ASCII (7 bits ASCII)which is a proper subset of both ISO-8859-1 and UTF-8.

4) Some packages already use UTF-8 in their DESCRIPTION file. Forinstance, there is an "ø" (C3B8, LATIN SMALL LETTER O WITH STROKE) inthe generate_html package and a "ë" (C3AB, LATIN SMALL LETTER E WITHDIAERESIS) in the image package.

5) For the packages where DESCRIPTION contains UTF-8 characters, Iassume (sorry, not exactly a fact anymore) that the html produced bygenerate_package_html () has been manually edited to replace"charset=iso-8859-1" by "charset=utf-8". @Søren, Carnë: is that correct ?


=== END FACTS ===

I would like to come up with a solution that is clear and consistent forthe *automatic* processing of DESCRIPTION files (no more manual editingshould be needed).


Here are some options.

A) Assume US-ASCII. Error if any character > 0x7F is encountered.

A') Same as A, unless a optional ENCODING file is present, in which caseDESCRIPTION (and COPYING, and NEWS) is assumed to have the encodingindicated in that file.

B) Assume ISO-8859-1. For "ø" and "ë" this wouldn't be a problem (F8 andEB) but sooner or later a package manager whose name cannot be writtenin ISO-8859-1 will join the project...


B') Assume ISO-8859-1 with an optional ENCODING file.

C) Assume UTF-8.

C') Assume UTF-8 with an optional ENCODING file (for package managerthat *really* don't want to use UTF-8).

D) In A', B' or C', use a new optional field in DESCRIPTION instead ofan ENCODING file.

I would vote for A' (just requires a small number of packager managersto add an ENCODING file) or C (doesn't seem to require any additionalwork at all).


Any thoughts ?

[Prev in Thread]

Current Thread

[Next in Thread]

Encoding issues : the DESCRIPTION file, Julien Bect <=
- Re: Encoding issues : the DESCRIPTION file, Oliver Heimlich, 2015/01/20
  - Re: Encoding issues : the DESCRIPTION file, Julien Bect, 2015/01/20

Prev by Date: Re: deprecated use of octave_allocator
Next by Date: Re: Encoding issues : the DESCRIPTION file
Previous by thread: tsa 4.2.8 released
Next by thread: Re: Encoding issues : the DESCRIPTION file
Index(es):
- Date
- Thread