bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: green fox
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sat, 12 Jul 2014 16:26:31 +0900

On 7/11/14, Nelson H. F. Beebe <address@hidden> wrote:
> There is an extensive discussion going on now about that issue on the
> TeX Live list, which is archived at
>
>       http://tug.org/mailman/listinfo/tex-live
I will check it out.

> The problem is much more complex than some people think, and part of
It is much worse than that, I'm afraid.
Dealing with the problem for the last 20 years down here...

> Thus, because UTF-nn encodings of Unicode do not permit all possible
> byte combinations, it is quite easy to have filenames that must be
> handled by software, and yet whose names are not describable as valid
> Unicode character strings.
Yes, and we work around that by treating everything but 0x2f || 0x00 as blobs.
We had to do that, and it worked nicely.
Until some random American came up with the idea that the worlds language can
fit into 16 bit. haha....then 32bit,...hahaha... then
trim-byte-5,byte6 off utf8...hahahahahah....
and CJK...and Arabic bidi.... hahahahahahahahah. They never thought hard
enough about us who actually have to use the crippled system.
Sorry, no mean to be a troll. just that this side of the stick is
short a bit too much.

> This is all a HUGE can of worms, and I suspect that we should avoid
> opening it.  It is worth recalling that Brian Kernighan at one point
> added some limited support in nawk for multibyte coding and
> internationalization, then withdrew it on finding the portability
> problems that it exposed.  His FIXES file entry of 28 July 2003 says:
>
>       a moratorium is hereby declared on internationalization changes.
>       i apologize to friends and colleagues in other parts of the world.
>       i would truly like to get this "right", but i don't know what
>       that is, and i do not want to keep making changes until it's clear.
>
> The awk, mawk, nawk, and oawk implementations treat files as character
> streams, where NUL (0x00) is a string terminator.  By contrast, gawk's
> view of files is that they are simply byte streams, and no byte value
> has any more significance than any other byte value: 0x00 is just a
> normal data byte.  Thus, with care, gawk can be used to read and write
> arbitrary files.  From that point of view, the less it knows about
> `characters', the better.
Agreed.
The sad part is, we no longer have the capability in gawk to return to
'byte stream'.
Once your in 'character stream', no way out of it...no such function as
 sir_gawk_I_want_to_be_fed_byte_stream_data_please()

I ask for power to have the _do_what_I_ask_ option, so I can write
0x80-0xff range.
It would be nicer if I had always_byte_based_length() and
always_byte_based_substr()
then the rest of the routine can be written on top of them...



reply via email to

[Prev in Thread] Current Thread [Next in Thread]