chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] UTF-8 support in eggs


From: Alex Shinn
Subject: Re: [Chicken-users] UTF-8 support in eggs
Date: Fri, 11 Jul 2014 10:28:46 +0900

On Fri, Jul 11, 2014 at 7:20 AM, Oleg Kolosov <address@hidden> wrote:
On 07/09/14 09:00, Alex Shinn wrote:
> The clean way to handle this is to duplicate the useful string
> APIs for bytevectors.  This could be done without code duplication
> with the use of functors, though compiler assistance may be
> needed for efficiency (e.g. for inlined procedures).  Even without
> code duplication there would be an increase in the core library
> size, though we could probably move most utilities to external
> libraries (how often do you need regexps that operate on binary
> data?).

Considering Chibi Scheme size numbers from your other mail, I hardly
call this a huge price for the benefit received. Even for my specific
embedded use cases.

Note Chibi factors out all but a few string utilities into
separate libraries, i.e. the Chibi core is smaller than the
Chicken core.  The size increase for Chicken would thus
be correspondingly larger, though still likely very small.

> The bigger issue from the performance perspective is existing
> idioms that use indexes, which can degrade to quadratic behavior
> in the worst case no matter how much you optimize (without hacks
> that slow down normal usage).  So people would have to learn to
> take substrings where appropriate to avoid the start/end parameters
> to all SRFI 13 functions, or we would need to deprecate SRFI 13
> in favor of a cursor-oriented API (planned for R7RS).

Do you have some examples on how to avoid performance degradation and
not use string indexes?

Just don't use string indexes - they're not useful.  Passing
and returning cursors (byte offsets into strings) is all you need. [*]

In the more common cases, just using string ports, string-map,
or loop syntax hides the underlying iteration (a good loop macro
has potential to be faster than manual iteration).

How about more complex formatting like
outputting numbers with padding? I guess these should be handled with
something like fmt (or chibi.show).

Well, this is completely orthogonal to utf8, but probably the
most important performance hack for combinator formatters
is Chicken's define-compiler-syntax.

-- 
Alex

[*] With very few exceptions, the only example of which I'm aware
of is Boyer-Moore. However, string search on utf8 bytes is faster than
on UCS-32 codepoints, so the trick is to just provide string search as
part of an API and let implementations optimize accordingly.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]