bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: Eli Zaretskii
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sat, 12 Jul 2014 14:32:51 +0300

> Date: Sat, 12 Jul 2014 19:47:12 +0900
> From: green fox <address@hidden>

Why private mail?  I've CC'ed the list again.

> I use a language that needs multiple byte to represent correctly.

Which language is that?  And which multibyte representation do you
want to support in Gawk for that language?  Please be specific; these
are very technical issues that cannot be discussed on such a high
level, without any details.

> The language is _very_ sensitive in how it must be presented.
> Location,order, width, height, place of dot, where to split line, is
> all very important.

These are all very important in many languages, I agree.  But they are
not, and should not be, Gawk problems.  Gawk is a text processing
tool, so it processes text in the logical order, disregarding any
display-time features.  If you are looking for solutions for
display-time problems, Gawk is not the right place to look.

> And, my country has cities and places, and name of people who are alive,
> where the letter is not included in Unicode.
> The situation was very very bad. My country made there own code page to
> solve the problem.

What codepage is that? does it have a name or a number?  Is it
supported by any popular system out there, and if so, which systems
support it?

> For America/Europe, Multibyte, CJK, bidi, is all Render issue.
> For me and my country, it is all REAL day-to-day problems on handling text.
> 
> bidi is very important too. It is not only Arabic. Old and new
> Chinese, Japanese,
> we write in right-to-left, and in some cases, up-to-down as well.

Displaying glyphs from right to left is not bidi, it's just a
different layout.  Bidi is about _bidirectional_ display, where the
direction changes from L2R to R2L and back within the same paragraph
of text.

And it is still about display, so it's unrelated to Gawk.

> Where to split, or insert
> data? We check type of character, calculate length, or lookup for next
> character with
> matching type.

I understand all that (and knew about it before), but what does this
have to do with Gawk?  Gawk doesn't split words, or calculate their
width or position on display, or consider any other display layout
issues.  Gawk _produces_ text, which some other piece of software (a
text terminal or emulator, a GUI rendering program, etc.) should then
present to the user.  It is that other software's job to select the
correct font and glyphs, reorder the text for display, be it bidi or
otherwise, and display it so that the result is legible for users who
speak that language.

Gawk has nothing to do with this, and cannot do anything about that.
Gawk's sole purpose is to process text.  For that, it does need to
know how to break text into characters and sometimes (while processing
regular expressions) into words.  But that's it.

> Please, take back your statement for the language you do not know about.

I'm not taking back anything.  I knew everything you told about
already, there's nothing new here for me, at least not on the level on
which you presented them.

You are not the only one who understands these issues, or have ever
worked on them.

> >   . what problem(s), exactly, do you want to solve?
> Easier handling of Multi byte characters.
> Meaning, If absolute bare minimum necessary, just enough so
>  I can calculate code point by my self and print necessary byte stream to disk

Not sure what that means.  You cannot possibly have this if Gawk does
not understand the multibyte encoding of your language, because
there's very little Gawk can do with bytes if it doesn't know how to
break them into characters.  You will have a very crippled, perhaps
even unusable, Gawk.  This "bare minimum" makes very little sense to
me.

> >   . what solution(s) do you propose for that?
> Not solution. Basic least intrusive addition to gawk.
> Will be used as building block for complex features we need.

Please elaborate: what exactly do you want from Gawk to be able to do
that it cannot do now?

> >   . why do you think that having those solutions as loadable
> >     extensions (which are always distributed and installed with Gawk)
> >     is not TRT?
> I have pondered on the gawk extension idea myself for a while.
> In general, the idea itself is very good. And hard work has been put in.
> 
> The problem that needs ironing out is
> 1) How to detect if lib is missing / handle gracefully when function
> is not available.

Why would you need that, if the library is _always_ present.  It is
part of the Gawk distribution.

And assuming that the library _is_ missing -- what would your Gawk
script do in that case, except abort (something Gawk already does)?

> 2) Environmental difference. MSWIN/UNIX the awk script works nicely if
> taken care.
>     Needs much tougher deep thinking and planning on deploy using libs.
>     I do not have a good solution to this problem yet.

Gawk already puts the libraries where it will automatically find them,
both on Windows and on Posix systems.  So I'm not sure which problem
are you alluding to here.

> 3) Feasibility of having something similar to libffi, so other things
> can be loaded
> without wrapper.
> We have lots of conversion routines, and if we can specify the calling
> convention
> from within the script, it would be very nice.

That's what the dynamic loading feature in Gawk is all about.  Just
use it; no need for libffi.

> The minimum capability I need was read / write of binary data.

You were suggested to write an extension to do that.  If you think
this would be impossible with an extension, please tell the details of
why you think so.  I don't see any obstacles here.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]