bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: Eli Zaretskii
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sun, 13 Jul 2014 08:00:17 +0300

> Date: Sun, 13 Jul 2014 10:08:58 +0900
> From: green fox <address@hidden>
> Cc: address@hidden
> 
> To summarize, it is
> -IPA updates, separation of render and formatting before passing it off is
>  (a) possible for ASCII (b) not that nice for multi byte characters,

I very much question the validity of this.  I think this is the
results of years of hacking around existing software, instead of
adopting the correct solutions, which allow clear separation between
text processing and text presentation.

> Japanese and Chinese is what I use daily.
> JISX2013 and BIG5 extended.
> And as you know, these two have lots of extensions.
> I do know that in theory, round trip conversion is possible with the
> Unicode code points. In reality, I need to check for certain characters,
>  handle exceptions, and be really careful.
> And these are supposed to be all included into ISO/IEC 10646
>  (in a perfect world).
> We agreed to use IVS , however the process is not complete yet. It is on 
> going.
> Please check the Unicode IVD section, and Update listing from IPA,
> ITSCJ committee for detail on what is going on. I know you know, but still.
> To summarize, some letters are still not included.

AFAIU, this is all about presentation, not about processing the text.

> _IF_ the code page (or should I say, code points and the glyph)
> that is used daily in my country by lots of people is not registered
> as _international_ well known standard, do you disagree, even to help
> convert it to Unicode ?

Sorry, I don't understand the question: disagree to what?

In any case, if the codepage is supported by iconv, or you have a
patched iconv that includes its support, then you don't care whether
that codepage is an international standard or not; you can just use
it.

> > These are all very important in many languages, I agree.  But they are
> > not, and should not be, Gawk problems.  Gawk is a text processing
> > tool, so it processes text in the logical order, disregarding any
> > display-time features.  If you are looking for solutions for
> > display-time problems, Gawk is not the right place to look.
> 
> Humm... sorry, I can not understand the humor that you put there.

There's no humor, I'm dead serious.

> Lets say I was crafting a piece of html page, from some text.
> I use regexp pattern to strip out what I do not need.
> Then, I print necessary tags, the text, then more tags, and it is all good.
> This only works in single byte characters, and limited set of multi-byte.

No, this should work for any text, as long as the regular expression
engine supports the Unicode regexp syntax per UTS#18.

> The reason is, for some characters, if I do not specify correct 'hint'
> the text becomes useless. In case of html, I must pass the type of
> language used, as separate tags. With out such hint, the same
> character gets printed as a different language.

I don't understand how can processing of text produce a different
language.  Can you show a simple example, with processing by Gawk,
that does something like that?

> Therefore, being able to 'hint' correctly is very crucial.
> Think of it as leaving out the umlaut in German. Not nice.

Gawk does not strip umlauts, or any other diacriticals.  It might
replace some characters by others, but only according to the
replacement rules in your Gawk script.  If that script doesn't say to
replace ű by u, such a replacement will not happen.

Likewise with any other language.

> I think building a html / tex / ps from some text, using gawk is a
> valid use case. Maybe not for you, Eli, so we may differ here.

It _is_ a valid use case.  But I don't see how these use cases could
convert characters from one language to another, unless the Gawk
script explicitly called for that.

> To processes text in the logical order, as you say, it is not automated
> for us. UCA and ISO 14651 are two very different beasts.

I didn't mean string sorting order, I meant the order of characters in
the text, in the context of bidi.  See UAX#9 for the definition of
"logical order".

Sorting order is something the underlying C library needs to
implement.  Gawk just uses whatever is available in the C library for
a given locale.

> For _my_ daily use cases, I have files with names of people,
> and scans of old documents. They include letters that are not available
> on Unicode (at the moment) so we use some extensions so it can be
> resolved later.

Unicode provides Private Use Area (PUA) for this purpose.  You can use
them, in Gawk and elsewhere, and still be Unicode and UTF-8 compatible.

> > Displaying glyphs from right to left is not bidi, it's just a
> > different layout.  Bidi is about _bidirectional_ display, where the
> > direction changes from L2R to R2L and back within the same paragraph
> > of text.
> Without thinking (calculating) how it gets rendered, its really
> impossible to perform the kinsoku operation

What does kinsoku have to do with Gawk?  Gawk doesn't break/wrap lines
on its own, it does what the script tells it to do.

> repeated-letter substitution

Details (or pointers), please.  I don't know what you mean by this.

> But at least even try to build a system for our language. Then you know.
> If you already have such a system that works nicely in our language,
> tell me, I want to evaluate. I am not being sarcastic or irrational.
> If such nice working system exists, I really want to use it now.

Try Emacs.

> In a ideal world, the render layer and string manipulation layer
> is separate. I wish I can just handle multi byte characters like ASCII.
> But in reality I can not. I _can_, but with out heavy lifting, the
> outcome is very terrible, in both the character order,
> _and_ when rendering
> (some things must be handled before passing it upstream to render layer).

These are high-level slogans.  Please tell the details or show
examples to make your point clear.

I see no reason why processing multi-byte text should be impossible
without involving rendering levels.  In Emacs, we do have that
separation, so it _is_ possible, and not just for ASCII.

> Because if gawk is at character stream mode, we have no good / sane
> way to output specific byte at the moment (v4.1.60). We used to have
> such capability in the past (v3.2+patch).

It is still unclear to me why would you need that.  I suspect that is
because you still try to hack around multi-byte code, instead of
adopting it and "going with the current".  Use PUA when the existing
Unicode characters are not enough.

But if you insist on having such byte-stream I/O, you can have it as a
Gawk extension, as was suggested to you.

> if you knew the problem, it would be nice to share the solution with
> us, and the rest of the world if you do not hesitate.

Look at Emacs, it is one of the most complete solutions to these
problems I know of.

> > 'it[gawk] doesn't know how to break them into characters'
> if we run gawk with hands_off_my_data flag, we can reuse scripts written
> in the past.

Those scripts are broken.  You will be better off tossing them, and
instead using the multi-byte approach.  It will do you good, and your
language support will be better and more complete.  There's no need to
use today 20-year old hacks.  Things have changed for the better,
there are many features available today that wasn't back then.  Don't
cling to the past.

> Can I not write extensions that call other liblary ?

You can.

> In particular, being able to call iconv, nkf is nice.
> But not all systems have it. I can use iconv most of the times, but
> on some occasions, the fine grain control of nkf is nice to have.
> 
> Something like
> Pusudo code
> if( some_way_to_check_if_libs_are_available( nkf ) ){
>   @load( nkf );
> }else{
>   if( some_way_to_check_if_libs_are_available( iconv ) ){
>     @load( iconv );
>   }
>   # maybe use a stripped down implementation written in awk script
>   @include( 'my_gawk_script_to_manipulate_strings.awk' );
> }
> from within awk script would be a nice thing to have.

Write an extension that can test whether a library is present, and you
can have this.

> >> 3) Feasibility of having something similar to libffi, so other things
> >> can be loaded
> >> without wrapper.
> >> We have lots of conversion routines, and if we can specify the calling
> >> convention
> >> from within the script, it would be very nice.
> >
> > That's what the dynamic loading feature in Gawk is all about.  Just
> > use it; no need for libffi.
> Humm... I will continue to look into it.
> Just a question. Have you used libffi before ?

Yes, I've used libffi.  The Gawk's extension support is supposed to
eliminate the need for that.

> the samples in ./extension/ suggests that I must write lots of wrappers.
> 
> So
>   script <address@hidden()-> gawk <--gawkapi--> wrapper <-std c api-> iconv
> and such.
> 
> if the function to something equivalent to libffi is available, it can be
> like
>   script <address@hidden()-> gawk <--std c api--> iconv
> and implement things in script.
> being able to directly load/unload gives me more power.

Look at the examples of extensions, they will show you how easy it is
to write one.  Awk is a very simple language, with a small number of
data types, so nothing as complex as FFI is needed.

> Interpreters that have this capability (in contrast to having to write
> wrappers for every and all libs one needs) allow more flexibility.
> 
> Many have it under name of loadlib or such. Lua, matlab, to name a few.

Lua and Matlab are much more complex systems.

Anyway, just look at the Gawk extension examples.

> If I write a libffi wrapper in gawkapi, problem is solved.

You don't need that.

> when I must print a byte, that is in the range of 0x80 - 0xff the
> known way is buggy and unrelyable.

I don't understand why, but you can always write a replacement fprintf
for that.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]