bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: Nelson H. F. Beebe
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Thu, 10 Jul 2014 19:35:56 -0600 (MDT)

Green Fox <address@hidden> writes today:

>> when one is reading from a ( disk / server ) that does not match
>> the local character set, the current gawk setup fails really badly.
>> When handling filenames that is not a valid utf-8, ....

There is an extensive discussion going on now about that issue on the
TeX Live list, which is archived at

        http://tug.org/mailman/listinfo/tex-live

The message traffic exhibits significant problems in the support of
non-ASCII characters in filenames.  

The problem is much more complex than some people think, and part of
the difficulties arise because: 

        (a) strings (such as filenames) are virtually never tagged
            with their character sets;

        (b) filesystems can be shared between disparate operating
            systems with different character set conventions; and

        (c) filesystem syntax generally views filenames as byte
            sequences, rather than character strings.

Thus, because UTF-nn encodings of Unicode do not permit all possible
byte combinations, it is quite easy to have filenames that must be
handled by software, and yet whose names are not describable as valid
Unicode character strings.

This is all a HUGE can of worms, and I suspect that we should avoid
opening it.  It is worth recalling that Brian Kernighan at one point
added some limited support in nawk for multibyte coding and
internationalization, then withdrew it on finding the portability
problems that it exposed.  His FIXES file entry of 28 July 2003 says:

        a moratorium is hereby declared on internationalization changes.
        i apologize to friends and colleagues in other parts of the world.
        i would truly like to get this "right", but i don't know what
        that is, and i do not want to keep making changes until it's clear.

The awk, mawk, nawk, and oawk implementations treat files as character
streams, where NUL (0x00) is a string terminator.  By contrast, gawk's
view of files is that they are simply byte streams, and no byte value
has any more significance than any other byte value: 0x00 is just a
normal data byte.  Thus, with care, gawk can be used to read and write
arbitrary files.  From that point of view, the less it knows about
`characters', the better.


-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: address@hidden  -
- 155 S 1400 E RM 233                       address@hidden  address@hidden -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------



reply via email to

[Prev in Thread] Current Thread [Next in Thread]