chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.


From: Alex Shinn
Subject: Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date: Sat, 26 Jan 2013 13:42:13 +0900

On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden> wrote:
On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <address@hidden> wrote:
Yes, I ran into this when I was adding UTF-8 support to mbox... If you were to add wide char support in srfi-14, is there a way to quantify the performance penalty?

To add the bounds check so it doesn't error?  Practically
nothing.

To branch to a separate path for a wide-char table if
the bounds check fails?  Same cost if the input is ASCII.

For efficient handling in the case of Unicode input...
how small/fast do you want it?

I've never met such stony silence in response to an offer to do work...

I ran the following simple char-set-contains? benchmark with
a few variations:

  (time
   (do ((i 0 (+ i 1)))
       ((= i 10000))
       (do ((j 0 (+ j 1)))
           ((= j 256))
         (char-set-contains? char-set:letter (integer->char j)))))

This is what most people are concerned about for speed, as
the boolean and construction operations are less common.

The results:

;; reference implementation
;; 0.312s CPU time, 1/2059 GCs (major/minor)

;; "fixed" reference implementation (no error but no support for non-latin-1)
;; 0.257s CPU time, 1/1706 GCs (major/minor)

;; utf8-srfi-14 with full Unicode char-set:letter
;; 0.243s CPU time, 0/1526 GCs (major/minor)

;; utf8-srfi-14 with ASCII-only char-set:letter
;; 0.242s CPU time, 0/1526 GCs (major/minor)

I was able to add the check and make the reference
implementation faster because I fixed the common case -
it was optimized for checking for 0 instead of 1.

Even with the enormous and complex definition of a
Unicode "letter", utf8-srfi-14 is faster than srfi-14.

As for what we want in Chicken, the answer depends
on what you're optimizing for.  utf8-srfi-14 will always
win for space, and generally for speed as well.

If the biggest concern is code-size, then you might want
to borrow the char-set definition from irregex and use
that as a "fallback" for non-latin-1 chars in the srfi-14
reference impl.  This would have the same perf as
srfi-14 for latin-1, yet still support full Unicode and not
increase the size of the Chicken distribution.

-- 
Alex


reply via email to

[Prev in Thread] Current Thread [Next in Thread]