bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: green fox
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory and fails when using printf("%c") supplied with large floating point value.
Date: Fri, 11 Jul 2014 09:47:52 +0900

Mr. Arnold, Thank you for the patch.
Patch applied. Working confirmed.

bash-4.2# ./gawk 'BEGIN{printf("%c",sprintf("%c",(0xffffff00+255)));}'|xxd
0000000: ff                                       .
bash-4.2# ./gawk 'BEGIN{printf "%c", sprintf("%c", 4.29497e+09 + 256)}'|xxd
0000000: e0ae 90                                  ...
bash-4.2# ./gawk -b 'BEGIN{printf "%c", sprintf("%c", 4.29497e+09 + 256)}'|xxd
0000000: 90                                       .
bash-4.2# ./gawk 'BEGIN{printf "%c", sprintf("%c", 4.29496e+09 + 256)}'|xxd
0000000: 80                                       .
bash-4.2# ./gawk 'BEGIN{printf("%c",sprintf("%c",(0xffffff00+255)));}'|xxd
0000000: ff                                       .

>The patch is below. I will shortly push this into the git repo.
Thanks!
>> While sorting out this printf("%c") bug, can we ask for capability to
>> write out binary data?
>>
>> Especially when printf("%c",128) or above, it fails and dumps
>> 0xc280, a valid utf-8, but not what was asked.

>Please use the -b option to get what you want.

After reading the source, I have come to understand how silly I was on
asking for a parameter that can be modified from within the script...
The buffered gawk_mb_cur_max , and handling of regexp patterns for m.b.
chars, and hadling in wstr2str() meant it was kinda like the BKL for awk.

Just a thought, _if_ I was to write code, which patch would you prefer
to accept?

A) Routines to address the issue for handling utf-8 string when -b is at effect.

B) Provide length(),substr(),index(),print() with extended capability to
   handle raw single byte data. (even when one is on a utf-8 system)

The reason asking this, is when one is reading from a ( disk / server )
that does not match the local character set, the current gawk setup
fails really badly.

When handling filenames that is not a valid utf-8, (from a file listing
that is mostly utf-8) yet a valid escape sequence, gawk fails
to reconize the correct length(). (for the obvious reason that we expect utf-8)

Working with xxd is not a nice solution. About 17hours ago, I had

LANG=C find -type f -print0 |xxd -ps -c 1 |\
awk '/^00$/{print t"00";t="";next;}{t=t""$0" ";}'|\
grep ' 2e 6a 70 67 00$' |\
awk '{w=substr($0,length("2e 2f <LONG_HEX_STRING_HERE> 39 41 2d")+1);
z=$0;
sub(" 00$","",z);sub(" 00$","",w);
print "6d 76 20 27 "z" 27 20 27 "w" 27 0a";}' |xxd -ps -r   |bash

to find all files with '.jpg' extension, rip out path that is
valid escape sequence yet invalid utf-8, mv to a valid file name.

Being able to specify length() in raw binary size (at needy times),
while still keeping the capability to handle utf-8 at normal would be
really nice... and it doesn't affect the regexp problem due to changing
gawk_mb_cur_max.
Thus was the ask for inclusion of handling data above 128-255 while
staying in ust-8.

Thanks for the fix anyway.

Green Fox



reply via email to

[Prev in Thread] Current Thread [Next in Thread]