Re: [Chicken-users] UTF-8 support in eggs

On Fri, Jul 11, 2014 at 7:20 AM, Oleg Kolosov <address@hidden> wrote:

On 07/09/14 09:00, Alex Shinn wrote:
> The clean way to handle this is to duplicate the useful string
> APIs for bytevectors. This could be done without code duplication
> with the use of functors, though compiler assistance may be
> needed for efficiency (e.g. for inlined procedures). Even without
> code duplication there would be an increase in the core library
> size, though we could probably move most utilities to external
> libraries (how often do you need regexps that operate on binary
> data?).

Considering Chibi Scheme size numbers from your other mail, I hardly
call this a huge price for the benefit received. Even for my specific
embedded use cases.

Note Chibi factors out all but a few string utilities into

separate libraries, i.e. the Chibi core is smaller than the

Chicken core. The size increase for Chicken would thus

be correspondingly larger, though still likely very small.

> The bigger issue from the performance perspective is existing
> idioms that use indexes, which can degrade to quadratic behavior
> in the worst case no matter how much you optimize (without hacks
> that slow down normal usage). So people would have to learn to
> take substrings where appropriate to avoid the start/end parameters
> to all SRFI 13 functions, or we would need to deprecate SRFI 13
> in favor of a cursor-oriented API (planned for R7RS).

Do you have some examples on how to avoid performance degradation and
not use string indexes?

Just don't use string indexes - they're not useful. Passing

and returning cursors (byte offsets into strings) is all you need. [*]

In the more common cases, just using string ports, string-map,

or loop syntax hides the underlying iteration (a good loop macro

has potential to be faster than manual iteration).

How about more complex formatting like
outputting numbers with padding? I guess these should be handled with
something like fmt (or chibi.show).

Well, this is completely orthogonal to utf8, but probably the

most important performance hack for combinator formatters

is Chicken's define-compiler-syntax.

Alex

[*] With very few exceptions, the only example of which I'm aware

of is Boyer-Moore. However, string search on utf8 bytes is faster than

on UCS-32 codepoints, so the trick is to just provide string search as

part of an API and let implementations optimize accordingly.

From:	Alex Shinn
Subject:	Re: [Chicken-users] UTF-8 support in eggs
Date:	Fri, 11 Jul 2014 10:28:46 +0900