Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden> wrote:

On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <address@hidden> wrote:

Yes, I ran into this when I was adding UTF-8 support to mbox... If you were to add wide char support in srfi-14, is there a way to quantify the performance penalty?

To add the bounds check so it doesn't error? Practically
nothing.

To branch to a separate path for a wide-char table if

the bounds check fails? Same cost if the input is ASCII.

For efficient handling in the case of Unicode input...
how small/fast do you want it?

I've never met such stony silence in response to an offer to do work...

I ran the following simple char-set-contains? benchmark with

a few variations:

(time

(do ((i 0 (+ i 1)))

((= i 10000))

(do ((j 0 (+ j 1)))

((= j 256))

(char-set-contains? char-set:letter (integer->char j)))))

This is what most people are concerned about for speed, as

the boolean and construction operations are less common.

The results:

;; reference implementation

;; 0.312s CPU time, 1/2059 GCs (major/minor)

;; "fixed" reference implementation (no error but no support for non-latin-1)

;; 0.257s CPU time, 1/1706 GCs (major/minor)

;; utf8-srfi-14 with full Unicode char-set:letter

;; 0.243s CPU time, 0/1526 GCs (major/minor)

;; utf8-srfi-14 with ASCII-only char-set:letter

;; 0.242s CPU time, 0/1526 GCs (major/minor)

I was able to add the check and make the reference

implementation faster because I fixed the common case -

it was optimized for checking for 0 instead of 1.

Even with the enormous and complex definition of a

Unicode "letter", utf8-srfi-14 is faster than srfi-14.

As for what we want in Chicken, the answer depends

on what you're optimizing for. utf8-srfi-14 will always

win for space, and generally for speed as well.

If the biggest concern is code-size, then you might want

to borrow the char-set definition from irregex and use

that as a "fallback" for non-latin-1 chars in the srfi-14

reference impl. This would have the same perf as

srfi-14 for latin-1, yet still support full Unicode and not

increase the size of the Chicken distribution.

Alex

From:	Alex Shinn
Subject:	Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date:	Sat, 26 Jan 2013 13:42:13 +0900