bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Exposing wcwidth(3) as a built-in function


From: Eric Pruitt
Subject: Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
Date: Fri, 8 Dec 2017 15:25:34 -0800
User-agent: NeoMutt/20170113 (1.7.2)

On Fri, Dec 01, 2017 at 09:41:14PM +0200, Eli Zaretskii wrote:
> > I decided to make a less hacky, portable version of the wcwidth
> > function. The rewritten script is much smaller, and it doesn't kill the
> > MAWK parser like its predecessor. Lookups are done using a binary search
> > on a table that is lazy-loaded at runtime. I've attached the updated
> > script to this email, but the canonical repository is
> > https://github.com/ericpruitt/wcwidth.awk .
>
> Thanks, but doesn't this still assume UTF-8 encoding of characters?
> If so, it's not portable to non-UTF-8 locales, right?

I realized I may've misinterpreted your question, so I will clarify and
add a question of my own: only the code for interpreters that are not
multi-byte safe falls back to manual UTF-8 parsing. This means that in
GAWK, the lookup table uses lexical comparisons assuming the locale is
multi-byte safe. In MAWK, however, the lookup table was* indexed by
numerical code points. Are there some multi-byte locales where I could
not count on sprintf("%c", 23485) being "宽" in GNU Awk? From running
"fgrep -ir iconv --include '*.h' --include '*.c'", it doesn't look like
GAWK uses iconv. Perhaps a more accurate question is, will GAWK work on
platforms that do not have **any** Unicode support (be it UTF-8, UTF-16,
etc.)?

* I have since rewritten the code for multi-byte unsafe interpreters so
  the lookup table is indexed by UTF-8 byte strings instead of numeric
  code points for performance reasons.

Eric



reply via email to

[Prev in Thread] Current Thread [Next in Thread]