bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] difference in iconv for EBCDIC SBCS conversion fr


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] difference in iconv for EBCDIC SBCS conversion from z/OS OS-provided iconv
Date: Sun, 02 Apr 2023 23:40:42 +0200

Hi Mike,

>   * Why was the bug report that wanted glibc's IBM1047 mapping table changed
> >     closed as "NOT A BUG"?
> >     https://bugzilla.redhat.com/show_bug.cgi?id=170072
> 
> Based on Eric and Jakub's discussion, I would agree with Jakub that
> unfortunately we seem
> to have 2 'standards' here which are incompatible and it would be good for
> the user community
> if we supported both.

This bug report references http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf
from 2003. In the newest edition, at
https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf ,
there are more details:

  * The "newline function" is represented by
    - U+000A (= LF) on Unix,
    - U+0085 (= NEL) on EBCDIC-based OS,
    - U+000D (= CR) on MacOS 9 and earlier.
    [Table 5-2]

  * In EBCDIC-based OSes other than z/OS (you listed them in
    <https://lists.gnu.org/archive/html/bug-gnu-libiconv/2023-04/msg00002.html>)
    this newline function is represented by EBCDIC 0x15.
    [Table 5-1]

  * Likewise "text files on z/OS traditionally use NEL for the newline 
function."
    [page 210]

  * But in the z/OS Unix System Services U+000A maps to EBCDIC 0x15. [Table 5-1]
    "That mapping arises from the use of the LF character for the newline
     function in C programs and in Unix environments"
    [page 210]

So, for z/OS users, it appears to depend whether they are working more with
"traditional" programs or more with "C programs and Unix environments".

> >   * Why did msbrown write "Note that "line feed" is 0x25 in
> > EBCDIC/IBM-1047, but
> >     the C language '\n' is 0x15 (EBCDIC "new line")." ?
> >     https://www.austingroupbugs.net/view.php?id=251
> 
> This is the crux of the situation. A huge number of tools are either
> written in C/C++ or
> the tools are built with other tools written in C/C++ and the '\n' in all
> the code is 0x15.
> So choosing a different value for a file means that none of those tools
> work. In particular,
> if you iconv a file and try to use 'less' it won't work because it won't
> 'see' the newlines.

I see. It sounds like there was a standard at some point; then the
tools written in C made up a different de-facto standard, and now the
original standard is less relevant (but no new formal standard was issued).

> I would think on z/OS for our UNIX System
> Services customers
> we could 'compile in' this value (like PCRE) which would be my preference
> over an
> environment variable.

I prefer an environment variable, for these reasons:

  1) Given the text that I've cited above it looks like some "traditional"
     files on z/OS need U+0085 to map to EBCDIC 0x15. Therefore, for the same
     user in the same OS, sometimes one way is desired, sometimes the other
     way.

  2) Compatibility with glibc iconv and recode 3.7.x; both map U+0085 to
     EBCDIC 0x15:

     $ echo hello | iconv -f ASCII -t IBM1047 | hd
     000000  88 85 93 93 96 25                                .....%
     $ printf 'hello\u0085' | iconv -f UTF-8 -t IBM1047 | hd
     000000  88 85 93 93 96 15                                ......

     $ echo hello | recode ASCII..IBM1047 | hd
     000000  88 85 93 93 96 25                                .....%
     $ printf 'hello\u0085' | recode UTF-8..IBM1047 | hd
     000000  88 85 93 93 96 15                                ......

  3) It is more sensible to _deviate_ from a formal standard by setting
     an environment variable, than it is to _adhere_ to a formal standard
     by setting an environment variable. (Remember the days of the
     POSIX_ME_HARDER environment variable? :-) )

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]