[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory
From: |
Nelson H. F. Beebe |
Subject: |
Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ... |
Date: |
Thu, 10 Jul 2014 19:35:56 -0600 (MDT) |
Green Fox <address@hidden> writes today:
>> when one is reading from a ( disk / server ) that does not match
>> the local character set, the current gawk setup fails really badly.
>> When handling filenames that is not a valid utf-8, ....
There is an extensive discussion going on now about that issue on the
TeX Live list, which is archived at
http://tug.org/mailman/listinfo/tex-live
The message traffic exhibits significant problems in the support of
non-ASCII characters in filenames.
The problem is much more complex than some people think, and part of
the difficulties arise because:
(a) strings (such as filenames) are virtually never tagged
with their character sets;
(b) filesystems can be shared between disparate operating
systems with different character set conventions; and
(c) filesystem syntax generally views filenames as byte
sequences, rather than character strings.
Thus, because UTF-nn encodings of Unicode do not permit all possible
byte combinations, it is quite easy to have filenames that must be
handled by software, and yet whose names are not describable as valid
Unicode character strings.
This is all a HUGE can of worms, and I suspect that we should avoid
opening it. It is worth recalling that Brian Kernighan at one point
added some limited support in nawk for multibyte coding and
internationalization, then withdrew it on finding the portability
problems that it exposed. His FIXES file entry of 28 July 2003 says:
a moratorium is hereby declared on internationalization changes.
i apologize to friends and colleagues in other parts of the world.
i would truly like to get this "right", but i don't know what
that is, and i do not want to keep making changes until it's clear.
The awk, mawk, nawk, and oawk implementations treat files as character
streams, where NUL (0x00) is a string terminator. By contrast, gawk's
view of files is that they are simply byte streams, and no byte value
has any more significance than any other byte value: 0x00 is just a
normal data byte. Thus, with care, gawk can be used to read and write
arbitrary files. From that point of view, the less it knows about
`characters', the better.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: address@hidden -
- 155 S 1400 E RM 233 address@hidden address@hidden -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
- Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...,
Nelson H. F. Beebe <=
Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ..., Eli Zaretskii, 2014/07/12