[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: about strings, symbols and chars.
From: |
Jorgen 'forcer' Schaefer |
Subject: |
Re: about strings, symbols and chars. |
Date: |
24 Dec 2000 06:50:22 +0100 |
User-agent: |
Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 |
Dirk Herrmann <address@hidden> writes:
> On 19 Dec 2000, Jim Blandy wrote:
>
> > Certainly, scm_mb_get will be as slow as, or slower than,
> > SCM_CHAR_GET. It's there for the cases where people need simplicity
> > over performance.
>
> This sounds as if the proposal offered a faster way to access characters
> than scm_mb_get? How can that be? If, in principle, every string may
> potentially contain multi-byte characters at any position, you _have_ to
> check every character.
If i understood Jim correctly, he thinks scm_mb_next (and it's
cousins like scm_mb_walk) will be faster than either SCM_CHAR_GET
or scm_mb_get, because it operates on the char array directly.
SCM_CHAR_GET has to check the type (and thus the size of the
chars) of the string on each access. On the other hand,
scm_mb_next has to check how long the next char is on each
access, making it not much faster, if faster at all.
[The following is kinda long, I guess you're aware of all of
this. I have a short, not very useful conclusion at the end,
though]
I think that the whole problem of multi-byte vs. fixed-byte
encoding is not much of a performance issue. Fixed-byte strings
are "simpler", and can be accessed randomly without performance
overhead (you could provide a macro which extracts the width of a
given string), but have problems regarding memory usage. A
single non-latin-1 charakter in a 4k string would make the whole
string take up 8k (8bit to 16bit expansion), while in multi-byte
it requires 4k+1 bytes (long strings are rather uncommon in
usage, though). Multi-byte strings have problems when it comes
to setting the value of characters -- you might have to copy the
rest of the string if it's size differs from the previous
character size. Fixed-width strings need only be copied if you
put in a character which needs a "bigger" encoding than you had
available before.
The only real disadvantage of multi-byte strings seems to me that
it's more difficult to set characters at places which had a
different width before. A more functional approach here would be
benefical.
The disadvantages of fixed-width strings are that they can be
overly space-consuming and require a similar copying as the
multi-byte version, but less often. Also, they need to
differenciate between different "types" of strings.
> With a variable width encoding I see problems if threads are used: A
> thread that does a string-set! can modify the byte positions of a large
> set of characters
> [...]
> Things are not really different with fixed-width encodings: Doing a
> string-set! can require to switch a whole string from a single-byte
> representation to a two or four byte representation. But the
> recalculation of a character's position is a fast operation.
With multi-byte strings it's "calculate size difference, copy
memory region", which is even more effecient than, say, copying n
1byte locations to n 2byte locations, since the former can be
done wordwise. But the fixed-width string has to be copied only
once, while the multi-byte string has to be copied many times
over (assuming you're setting a range of chars to a different
size encoding).
Fixed-width strings are faster if setting different-width chars,
which isn't required often and can be avoided.
Fixed-width strings can be easily accessed randomly, though most
of the time, strings are accessed sequencial, which is as fast as
with the multi-byte case.
Fixed-width strings require different types, and switch on it's
type on each access, but multi-byte requires a switch on the
first byte of the next character on each access.
Fixed-width strings consume more memory, but this is not really
relevant since really long strings are rare, and memory isn't.
Concluding, there's not much difference between the two
representations. I know this is a long mail just to say "hey,
it's not much of a difference", but i guess i had to write it.
Maybe someone can show me where i overlooked something?
Well, just my few cents...
-- jorgen
--
((email . "address@hidden") (www . "http://forcix.cx/")
(irc . "address@hidden (IRCnet)") (gpg . "1024D/028AF63C"))